method: 4-bit LoRA (4 rounds)2023-12-15
Authors: Aashiq Muhammed, Kevin Kuo
Affiliation: Carnegie Mellon University
Email: kkuo2@andrew.cmu.edu
Description: Our method reaches the target ANLS with 16MB of communication (4 rounds, 4MB per round). This is a 750x improvement over the baseline, which reaches the target ANLS using 12 GB of communication (3 rounds, 4GB per round).
Our method combines the following techniques:
1. We use LoRA (rank=8) to reduce the num. of trainable parameters to ~3.6M. (14.4MB)
2. We apply NormalFloat-4 quantization with a block size of 64, which reduces the size from (3.6M)*(4 bytes/param) = 14.4MB to 14.4*(4/32 + 1/64) = 2MB.
3. At each round, we train on a single client for 12 local epochs.
Additional details:
*To be consistent the behavior of the provided code, we do not log communication of the frozen weights (which remain in fp32).
*We quantize and send the delta from the initial LoRA weights. This does not affect communication because all clients can locally generate the initialization.
*For the first round, we log the same size as the other rounds. However in practice we only need to send a random seed to generate the initialization.
method: Communication Tuned Low-Rank Adaptation of Document Encoder2023-10-26
Authors: Aashiq Muhamed, Kevin Kuo
Affiliation: Carnegie Mellon University
Description: The current baseline uses LoRA in the T5 language encoder alone. The weights of the visual and spatial encoders are frozen. We use LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1) in the language model. The model achieves 82.81 val ANLS in 7 rounds. We attach model__7 ckpt and the corresponding communication log. LoRA, being a low rank matrix that is introduced in parallel to the existing weight matrices can be merged with existing matrices at inference time, so the net forward inference cost is unchanged from the original model.
Correction:
After applying LoRA, `encoder.embed_tokens.weight`, and `encoder.embed_tokens.weight` show up in state_dict but not in model.parameters() . The current code excludes a tensor if it has requires_grad=False but counts it if it doesn't show up in model.parameters(). We had to correct this to get a true estimate of LoRA communcation size.
method: LoRA baseline2023-10-23
Authors: Aashiq Muhamed
Affiliation: Carnegie Mellon University
Email: amuhamed@andrew.cmu.edu
Description: The current baseline uses LoRA in the T5 language encoder alone. The weights of the visual and spatial encoders are frozen. We use LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1) in the language model. The model achieves 82.81 val ANLS in 7 rounds. We attach model__7 ckpt and the corresponding communication log. LoRA, being a low rank matrix that is introduced in parallel to the existing weight matrices can be merged with existing matrices at inference time, so the net forward inference cost is unchanged from the original model.
https://arxiv.org/abs/2106.09685
Commumication | Question-Answering | ||||||||
---|---|---|---|---|---|---|---|---|---|
Date | Method | Total GB | FL Rounds | ANLS | Accuracy | OT | |||
2023-12-15 | 4-bit LoRA (4 rounds) | 0.0153 | 4 | 0.8687 | 77.0870 | T | |||
2023-10-26 | Communication Tuned Low-Rank Adaptation of Document Encoder | 0.3797 | 7 | 0.8566 | 76.2199 | T | |||
2023-10-23 | LoRA baseline | 5.5272 | 7 | 0.8566 | 76.2199 | T | |||
2023-10-27 | FedShampoo | 10.0174 | 3 | 0.8891 | 79.4751 | T | |||
2023-10-26 | (Baseline) FedAvg Baseline | 44.6561 | 10 | 0.8873 | 79.3054 | T |