Red Team Participation Instructions - Health
🍅 Red Team Home Page
Login or Register to Elsa to enter / create a Team
Task Definition
-
Participants are tasked to launch membership inference attacks (MIA) on synthetic RNA sequencing datasets, TCGA-BRCA and TCGA COMBINED, generated with baseline generative methods.
-
TCGA-BRCA, <1,089 individuals x 978 genes>
-
Five subtypes with an imbalanced distribution
-
Suitable for subtype prediction task
-
-
TCGA COMBINED, <4,323 individuals x 978 genes>
-
10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin, with an imbalanced distribution
-
Suitable for cancer tissue-of-origin prediction task
-
-
-
The red teams will use the whole datasets as the test set and will identify which data points were used in the training set to generate the provided synthetic dataset.
-
In the first phase, they will be provided with three synthetic datasets per RNA-seq dataset generated by the baseline generator methods, respectively.
-
The baseline generative methods' codes are publicly available in Github Starter Package Repo as a part of Blue Team. These generative methods utilise the default values stated in config.yaml file.
-
While downloading the synthetic datasets, a text file containing the information on which method is used to generate the corresponding synthetic data is provided. This file also contains the information if a value of a parameter is changed in the config.yaml.
-
-
In the second phase, they will access the synthetic datasets generated by the selected set of blue teams and their white-box code.
Baseline methods
The following baseline methods and their respective performance on MIA are provided as part of the Github Starter Package Repo for participants.
We utilised some of the baselines methods provided in DOMIAS package,
-
DOMIAS KDE (Van Breugel et al., 2023)
-
GAN-leaks and GAN-leaks calibrated (Chen et al., 2020)
-
LOGAN (Hayes et al., 2019)
-
Monte Carlo (MC) (Hilprecht et al., 2019)
DOMIAS, GAN-leaks calibrated and LOGAN require an external reference dataset, which reflects the true data distribution and not utilized during generative or test processes.
We provide a reference dataset for TCGA-COMBINED dataset only, of <824 individuals x 978 genes >. This dataset is not shared with Blue team.
You are free to use relevant public datasets as a reference set in case your method depends on it.
📈 Evaluation
We provide classification performance metrics accuracy, AUC, and AUPR for baseline methods in Github Starter Package Repo.
We strongly encourage you to explore additional metrics that could provide better insights, and include these to your extended abstracts for CAMDA.
✅ Submissions checklist
The following files are required for benchmark method submission, compressed in a zip file, for each dataset:
-
config.yaml: Config file with attack model configurations,
-
Prediction files: CSV files with a single column named membership_label without index,
-
synthetic_data_1_predictions.csv
-
synthetic_data_2_predictions.csv
-
synthetic_data_3_predictions.csv
-
-
White-box code: Modified red_team.py and other necessary .py files
-
environment.yaml: Environment file to run and reproduce the results.
We expect two files from red teams during each submission period in the below filename format:
-
redteam_{teamname}_TCGA-BRCA.zip
-
redteam_{teamname}_TCGA-COMBINED.zip
Good luck! 🍀
References
Please make sure to cite the following papers if any of the baseline methods and evaluation metrics are mentioned/utilized in your CAMDA extended abstracts.
Competition related
-
CAMDA 2025 Health Privacy Challenge
Dataset sources
-
Genomic Data Commons (GDC), https://gdc.cancer.gov/, https://portal.gdc.cancer.gov/, accessed on Nov 1, 2024
Dataset preprocessing
-
Chen, Dingfan, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, and Mario Fritz. "Towards biologically plausible and private gene expression data generation." arXiv preprint. (2024)
-
Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology. (2014)
-
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot et al. "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research. (2016)
-
Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E., Lu, X., Gould, J., Davis, J.F., Tubelli, A.A., Asiedu, J.K. and Lahr, D.L. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell. (2017)
-
Landmark genes, https://clue.io/command?q=/gene-space%20lm, accessed on Nov 1, 2024
Generative models and evaluations
-
Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems. (2015)
-
Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." Advances in neural information processing systems. (2019)
-
Holsten, L., Dahm, K., Oestreich, M., Becker, M., & Ulas, T. "hCoCena: A toolbox for network-based co-expression analysis and horizontal integration of transcriptomic datasets. STAR protocols." (2024)
-
Lun ATL, McCarthy DJ, Marioni JC. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. (2016)
Membership inference attack models
-
Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)
-
Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)
-
Hilprecht, B., Härterich, M., & Bernau, D. "Monte carlo and reconstruction membership inference attacks against generative models." Proceedings on Privacy Enhancing Technologies. (2019)
-
Hayes, J., Melis, L., Danezis, G. & De Cristofaro, E. "Logan: Membership inference attacks against generative models." arXiv preprint. (2019)
Challenge News
Important Dates
Jan 14, 2025: Benchmark method submissions open.
Mar 15, 2025: Phase 1 benchmark submission deadline for blue and red teams.
Mar 22, 2025: Leaderboard announcement.
May 12, 2025: Phase 2 benchmark submission deadline for red teams.
May 15, 2025: CAMDA 2025 extended abstract submission deadline.
May 22, 2025: CAMDA 2025 acceptance notifications are sent.