Red Team Participation Instructions - Health

🍅 Red Team Home Page

Login or Register to Elsa to enter / create a Team

Red Teams can now join Phase 2, even without Phase 1 participation!
To access relevant data, please register ELSA Benchmark platform and create a Red team. We will reach out to you!

Task Definition

  • Participants are tasked to launch membership inference attacks (MIA) on synthetic RNA sequencing datasets, TCGA-BRCA and TCGA COMBINED, generated with baseline generative methods. 

    • TCGA-BRCA, <1,089 individuals x 978 genes> 

      • Five subtypes with an imbalanced distribution

      • Suitable for subtype prediction task 

    • TCGA COMBINED, <4,323 individuals x 978 genes>

      • 10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin,  with an imbalanced distribution 

      • Suitable for cancer tissue-of-origin prediction task 

  • The red teams will use the whole datasets as the test set and will identify which data points were used in the training set to generate the provided synthetic dataset. 

  • In the first phase, they will be provided with three synthetic datasets per RNA-seq dataset generated by the baseline generator methods, respectively.

    • The baseline generative methods' codes are publicly available in Github Starter Package Repo as a part of Blue Team. These generative methods utilise the default values stated in config.yaml file

    • While downloading the synthetic datasets, a text file containing the information on which method is used to generate the corresponding synthetic data is provided. This file also contains the information if a value of a parameter is changed in the config.yaml. 

  • In the second phase, they will access the synthetic datasets generated by the selected set of blue teams and their white-box code. 

 

Phase 2 participation instructions  

In Phase 2, we will share four Blue Team models with you. These solutions are numbered alphabetically, and each one has two corresponding synthetic datasets: TGCA-BRCA and TCGA-COMBINED. 

To access the Blue Team solutions and their synthetic datasets, please register on the ELSA Benchmark system and create a team. We will send you the relevant files via email. If you do not receive it within 24 hours, please contact us through the CAMDA Health Privacy Challenge Forum.

 

Attack Guidelines:

  • You can choose any solution to attack. However, you must provide predictions for both datasets associated with the selected solution.

  • You can attack multiple solutions to improve your chances of ranking higher on the leaderboard.

  • For solutions you do not attack, your score will default to the baseline attack score using the MC algorithm (Hilprecht et al., 2019).

 

Submission Requirements:

Each submission must include four prediction CSVs per dataset (eight in total).

  • For solutions you did not attack, submit an empty CSV file (0 bytes) with the correct naming structure.

For example, if an attack was launched on Model_1 and Model_4 only, the submission should look like this for both of the datasets:

  • synthetic_data_1_predictions.csv  (15KB)  

  • synthetic_data_2_predictions.csv  (Zero Bytes)  

  • synthetic_data_3_predictions.csv  (Zero Bytes)  

  • synthetic_data_4_predictions.csv  (12KB)  

 

For other requirements for submission, refer to the Submission Checklist below.

 

Phase 1 baseline methods 

The following baseline methods and their respective performance on MIA are provided as part of the Github Starter Package Repo for participants. 

We utilised some of the baselines methods provided in DOMIAS package,

  • DOMIAS KDE (Van Breugel et al., 2023) 

  • GAN-leaks and GAN-leaks calibrated (Chen et al., 2020)

  • LOGAN (Hayes et al., 2019) 

  • Monte Carlo (MC)  (Hilprecht et al., 2019)

DOMIAS, GAN-leaks calibrated and LOGAN require an external reference dataset, which reflects the true data distribution and not utilized during generative or test processes. 

We provide a reference dataset for TCGA-COMBINED dataset only, of  <824 individuals x 978 genes >. This dataset is not shared with Blue team. 

 

You are free to use relevant public datasets as a reference set in case your method depends on it.

 

📈 Evaluation 

We provide classification performance metrics, AUC, AUPR, TPR @ FPR = [0.01, 0.1] for baseline methods in Github Starter Package Repo.

We strongly encourage you to explore additional metrics that could provide better insights, and include these to your extended abstracts for CAMDA. 

 

✅ Submissions checklist 

The following files are required for benchmark method submission, compressed in a zip file, for each dataset:

  • config.yaml: Config file with attack model configurations, 

  • Prediction files: CSV files with a single column named membership_label without index,

    • synthetic_data_1_predictions.csv

    • synthetic_data_2_predictions.csv

    • synthetic_data_3_predictions.csv

  • White-box code: Modified red_team.py and other necessary .py files

  • environment.yaml: Environment file to run and reproduce the results. 

We expect two files from red teams during each submission period in the below filename format: 

  1. redteam_{teamname}_TCGA-BRCA.zip

  2. redteam_{teamname}_TCGA-COMBINED.zip

 

Good luck! 🍀

 

 

References 

Please make sure to cite the following papers if any of the baseline methods and evaluation metrics are mentioned/utilized in your CAMDA extended abstracts.
Competition related
  • CAMDA 2025 Health Privacy Challenge
Dataset sources
  • Genomic Data Commons (GDC), https://gdc.cancer.gov/, https://portal.gdc.cancer.gov/, accessed on Nov 1, 2024
Dataset preprocessing
  • Chen, Dingfan, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, and Mario Fritz. "Towards biologically plausible and private gene expression data generation." arXiv preprint. (2024)
  • Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology. (2014)
  • Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot et al. "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research. (2016)
  • Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E., Lu, X., Gould, J., Davis, J.F., Tubelli, A.A., Asiedu, J.K. and Lahr, D.L. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell. (2017)
  • Landmark genes, https://clue.io/command?q=/gene-space%20lm, accessed on Nov 1, 2024
Generative models and evaluations
  • Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems. (2015)
  • Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." Advances in neural information processing systems. (2019)
  • Holsten, L., Dahm, K., Oestreich, M., Becker, M., & Ulas, T. "hCoCena: A toolbox for network-based co-expression analysis and horizontal integration of transcriptomic datasets. STAR protocols." (2024)
  • Lun ATL, McCarthy DJ, Marioni JC. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. (2016)
Membership inference attack models
  • Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)
  • Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)
  • Hilprecht, B., Härterich, M., & Bernau, D. "Monte carlo and reconstruction membership inference attacks against generative models." Proceedings on Privacy Enhancing Technologies. (2019)
  • Hayes, J., Melis, L., Danezis, G. & De Cristofaro, E. "Logan: Membership inference attacks against generative models." arXiv preprint. (2019)

Important Dates

Jan 14, 2025: Benchmark method submissions open. 

Mar 15, 2025: Track I Phase 1 benchmark submission deadline for blue and red teams. 

Mar 22, 2025: Leaderboard announcement. 

May 12, 2025: Track I Phase 2 (for red teams) & Track II benchmark submission deadline. 

May 15, 2025: CAMDA 2025 extended abstract submission deadline. 

May 22, 2025: CAMDA 2025 acceptance notifications are sent.

Jul 23-24, 2025: CAMDA Conference @ ISMB/ECCB 2025 in Liverpool, UK.