Participation Instructions - Document Intelligence
Track 1 - Federated Learning only
The methods must be trained on Document Visual Question Answering (DocVQA) within a federated learning framework, simulating the need for cooperation between different entities to achieve the best-performing model in the most efficient way. Track 1 participant's objective is to reduce the communication used (#bytes), while achieving a comparable performance with the baseline. More specifically, they can't perform worse than 0.8242 (5% w.r.t. the baseline) of ANLS.
Track 2 - Federated Learning + Privacy-preserving
In addition to training over distributed data, we seek to protect the identity of the invoice providers that could be exposed from textual (provider company name) or visual (logo, presentation) information. Therefore, if a supposed malicious competitor (adversary) tries to infer information about a company's providers, it could not do it, since it would have a direct impact on the company's business.
Track 2 participants' objective is to achieve the best performance on question-answering while complying with a privacy budget of ε = 1, 4, and 8 in the Federated Learning set-up.
We have defined a series of rules to achieve fair comparison between participating teams and focus the efforts towards developing and improving Federated Learning (FL) and Differentially Private (DP) algorithms applied on Document Visual Question Answering task.
- Participants can only use the provided VT5 model to design their training method. Consequently, the model architecture can't be modified.
- The proposed methods must use the initial pre-trained weights provided by the organizers for the model. Furthermore, only the provided PFL-DocVQA dataset can be used to fine-tune these weights and no other public or private data can be used neither to pre-train or fine-tune the model.
- The training data is split into 10 different clients, this distribution can't be modified and consequently, each client must be treated independently, without sharing its data with other clients or the central server.
- Methods must be trained adhering to the provided Federated Learning set-up.
- For Track 2 participants (FL + DP): Methods must satisfy a privacy guarantee at least as strong as the pre-defined (ε, δ) requirement.
- No variants of the same method (with different hyperparameters) will be permitted. At the competition closure date, each team must have one single method, or otherwise clearly explain how submitted methods are different.
To participate in any of the tracks, first download the starting kit and the dataset. You can set up the baseline code by following the provided instructions of the framework. Remember that neither the architecture of the model nor the provided data can be modified. Instead, you will need to focus on the implementation of efficient and effective Federated Learning and Differentialy Private algorithms to reduce the communication overhead (Tracks 1 and 2) and keep the required privacy level (Track 2) while achieving the best question-answering performance. In the provided starting kit, already basic FL and DP algorithms are implemented, and therefore it is expected that you outperform the baseline scores.
To officially become a part of the competition, you need to submit the results of your training. Track 1 participants will need to submit the weights of the learned model and the FL communication log. For Track 2 participants you will need to submit the weights of your model complying with a privacy budget of ε = 1, 4, and 8. Therefore, Track 2 participants will submit 3 different sets of weights corresponding to the different privacy levels, with the corresponding 3 FL communication logs. After the end of the competition (October 27th), you will have 3 more days (November 1st) to submit the formal privacy proof that supports the model's privacy preservation. You can find more information about the submission format and requirements in the corresponding section at the end.
We provide a starting kit with a baseline architecture implemented within a framework to perform Visual Question Answering on documents following a Federated Learning setup, where Differential Privacy can be optionally applied.
The model used is a Visual-T5 (VT5). VT5 is a version of the Hi-VT5 described in the MP-DocVQA paper, arranged in a non-hierarchical paradigm (using only one page for each question-answer pair). We start from pre-trained t5-base for the language backbone, and pre-trained DiT-base to embed visual features (which we keep frozen during fine-tune phase). Then we fine-tune it on Single-Page DocVQA (SP-DocVQA) dataset using the MP-DocVQA framework.
Track 1: Federated Learning
The framework currently implements a basic Federated Average aggregation scheme for Federated Learning. The training data is divided into 10 different clients. In each FL Round, K clients are randomly selected and the model is trained using all the question-answer pairs of those clients. After completing the training for that round, each client sends their respective updated model to the server. The server computes the average of the received updated models, and sends it back to the next set of K randomly sampled clients for the subsequent round.
This baseline achieves 0.8676 of ANLS and 77.41 accuracy on the validation set after 10 FL Rounds. It transmits 1.12GB constantly for each communication stream, which results in a total of 44.66GB during the entire training process.
- K = 2
Track 2: Federated Learning + Differential Privacy
The framework currently implements a basic Federated Average aggregation scheme for Federated Learning and Differential Privacy based on Gaussian noise. The training data is divided into 10 different clients. In each FL Round, K clients are randomly selected, and M providers for each of the clients are further randomly selected. After training the model with all the question-answer pairs of each provider, the update is clipped according to the sensitivity. This limits the influence that any single provider can have on the model update. After the client has computed the update for all the providers, the updates are aggregated (summed up) and Gaussian noise is added to preserve the privacy. Finally, the model is updated with the new noisy update and sent back to the server. The server will aggregate all the received updated models and send the result back to the next set of K randomly selected clients for the subsequent round.
This baseline achieves on the validation set after 5 FL Rounds:
- ε: 1: 0.4620 of ANLS and 38.70 accuracy.
- ε: 4: 0.4837 of ANLS and 41.24 accuracy.
- ε: 8: 0.5030 of ANLS and 43.19 accuracy.
It transmits 1.12GB constantly for each communication stream, which results in a total of 22.32GB during the entire training process.
- δ = 10-5
- K = 2, M = 50
- Sensitivity = 0.5
- Gaussian Noise: Normal distribution noise with mean = 0, and std = sensitivity * noise_multiplier
- Noise multiplier: [1.145, 0.637, 0.468] for ε: 1, 4, and 8 respectively.
Submission Format and Requirements:
As described before, Track 1 participants will need to submit the weights of the model and the FL communication log. Track 2 participants are required to submit the weights of the model complying with a privacy budget of 1, 4, and 8. Therefore, track 2 participants will submit 3 different sets of weights, accompanying them with their corresponding 3 FL communication logs. After the end of the competition (October 27th), you will have 3 more days (November 1st) to submit the privacy proof that supports the model privacy preservation.
- Model weights: You can find the weights of your trained model under the directory save/checkpoints/ in the provided baseline framework.
- How to compute the ε budget: Track 2 participants are required to comply with a privacy budget of ε = 1, 4, and 8. We provide a script within the framework in /differential_privacy/privacy_calculator.py to compute this. However, this script is specific for the baseline setup and you may need to tune it to adjust it to your algorithm.
- Federated Learning Communication log: In the provided framework, at the end of each round a CSV file with a log of the communications employed is stored under the save/communication_logs/ directory.
- Privacy-preserving proof: The submitted solutions for Track 2 must be accompanied by theoretical proofs that ensure that the sensitive information in the training set remains confidential throughout the training process. We provide an example of privacy-preserving proof here. The Track 2 participants will have 3 extra days after the end of the competition to submit their formal proofs.
Although we strongly recommend using the provided PFL-DocVQA framework released within the starting kit. It is still possible to use your preferred framework to train the model. Just keep in mind that the model architecture can't be modified and you can't use any other data to pre-train or fine-tune your model. Also, you will need to provide the exact same files in the submission form. We provide a guide in this section to to create the required files with the proper format.
During the competition period, once you submit your method you will only see if the submission has been processed correctly. However, you won't be able to access to the results, as they will remain hidden until the end of the competition. This is intended to prevent cherry-picking of different variants of the same method with modifications or hyperparameters. Hence, it is expected that you find the best method by evaluating on the validation set, and then submit it before the competition ends (November 1). At the end, the results and leaderboard will be made public (November 15).
- VQA Performance: To evaluate the method's performance on question-answering, we will use the Average Normalized Levenshtein Similarity (ANLS), formally introduced during the ST-VQA Challenge. We will also show the accuracy performance as a secondary VQA metric. In Track 1 participants are required to achieve scores of at least 0.8242 and 0.8430 of ANLS on the validation and test sets respectively (a maximum of 5% performance drop w.r.t. the baseline). In Track 2 this will be the primary metric used to rank the methods, more specifically, we will use the average among the ANLS of the three methods corresponding to the ε 1, 4, and 8.
- Communication efficiency: We will measure the amount of information exchanged through the different clients in terms of bytes during training. More specifically, the communication metric consists of the sum of bytes communicated between the clients and server in both directions. For this, the participants will be required to use the provided framework and functions to report the amount of communication used during the training process. This will be the main metric for Track 1 participants, while it will only be an informative metric for Track 2.
- Privacy-preserving: The participants running into Track 2, will be required to have a privacy budget of no more than a pre-defined ε budgets: 1, 4 and 8. We provide a script within the framework in /differential_privacy/privacy_calculator.py to compute the ε budget your method is spending. However, this script is specific for the baseline setup and you may need to tune it to adjust it to your algorithm.
ELSA sponsored prizes for winners announced
Workshop at NeurIPS 2023
Communications log fixed in baseline code
Final version of the PFL-DocVQA framework released
Release of training and validation splits.
November 15, 2023: Winning teams announced.
November 1, 2023: Privacy proof reports due for Track 2 participant teams.
October 27, 2023: End of the competition. Submission data deadline.
June 30, 2023: Release of training and validation splits.
June 15, 2023: Competition registration opens.