method: 4-bit LoRA (4 rounds)2023-12-15

Authors: Aashiq Muhammed, Kevin Kuo

Affiliation: Carnegie Mellon University

Email: kkuo2@andrew.cmu.edu

Description: Our method reaches the target ANLS with 16MB of communication (4 rounds, 4MB per round). This is a 750x improvement over the baseline, which reaches the target ANLS using 12 GB of communication (3 rounds, 4GB per round).

Our method combines the following techniques:
1. We use LoRA (rank=8) to reduce the num. of trainable parameters to ~3.6M. (14.4MB)
2. We apply NormalFloat-4 quantization with a block size of 64, which reduces the size from (3.6M)*(4 bytes/param) = 14.4MB to 14.4*(4/32 + 1/64) = 2MB.
3. At each round, we train on a single client for 12 local epochs.


Additional details:
*To be consistent the behavior of the provided code, we do not log communication of the frozen weights (which remain in fp32).
*We quantize and send the delta from the initial LoRA weights. This does not affect communication because all clients can locally generate the initialization.
*For the first round, we log the same size as the other rounds. However in practice we only need to send a random seed to generate the initialization.

Authors: Aashiq Muhamed, Kevin Kuo

Affiliation: Carnegie Mellon University

Description: The current baseline uses LoRA in the T5 language encoder alone. The weights of the visual and spatial encoders are frozen. We use LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1) in the language model. The model achieves 82.81 val ANLS in 7 rounds. We attach model__7 ckpt and the corresponding communication log. LoRA, being a low rank matrix that is introduced in parallel to the existing weight matrices can be merged with existing matrices at inference time, so the net forward inference cost is unchanged from the original model.

Correction:
After applying LoRA, `encoder.embed_tokens.weight`, and `encoder.embed_tokens.weight` show up in state_dict but not in model.parameters() . The current code excludes a tensor if it has requires_grad=False but counts it if it doesn't show up in model.parameters(). We had to correct this to get a true estimate of LoRA communcation size.

method: LoRA baseline2023-10-23

Authors: Aashiq Muhamed

Affiliation: Carnegie Mellon University

Email: amuhamed@andrew.cmu.edu

Description: The current baseline uses LoRA in the T5 language encoder alone. The weights of the visual and spatial encoders are frozen. We use LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1) in the language model. The model achieves 82.81 val ANLS in 7 rounds. We attach model__7 ckpt and the corresponding communication log. LoRA, being a low rank matrix that is introduced in parallel to the existing weight matrices can be merged with existing matrices at inference time, so the net forward inference cost is unchanged from the original model.

https://arxiv.org/abs/2106.09685

Ranking Table

Description Paper Source Code
CommumicationQuestion-Answering
DateMethodTotal GBFL RoundsANLSAccuracyOT
2023-12-154-bit LoRA (4 rounds)0.015340.868777.0870T
2023-10-26Communication Tuned Low-Rank Adaptation of Document Encoder0.379770.856676.2199T
2023-10-23LoRA baseline5.527270.856676.2199T
2023-10-27FedShampoo10.017430.889179.4751T
2023-10-26(Baseline) FedAvg Baseline44.6561100.887379.3054T

Ranking Graphic - Communication cost (GB)