Participation Instructions - Health
🫐 Blue Team Home Page
Login or Register to Elsa to enter / create a Team
Task Definition
-
Participants are tasked to develop a method that generates a synthesised copy of the two RNA sequencing datasets, TCGA-BRCA and TCGA COMBINED.
-
TCGA-BRCA, <1,089 individuals x 978 genes>
-
Five subtypes with an imbalanced distribution
-
Suitable for subtype prediction task
-
-
TCGA COMBINED, <4,323 individuals x 978 genes>
-
10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin, with an imbalanced distribution
-
Suitable for cancer tissue-of-origin prediction task
-
-
The blue teams will use the provided scripts in Starter Package Github Repo to generate a stratified five-fold dataset splits and the performance of their methods will be evaluated on the average performance.
-
Each blue team must use a different random seed to generate their dataset splits and run their method.
-
For example, in the provided config.yaml in the examples, value of 42, is assigned to the random seed, in both dataset and generator method configurations. In the submission zip examples, value of 41, is used. This means you are not allowed to use these values.
-
Blue teams must run their experiments with a unique random seed consistent throughout dataset and generator configurations in the config.yaml file.
-
Baseline methods
The following baseline methods and their respective performance on the selected evaluation metrics are provided as part of the Starter Package Github Repo for participants.
-
Multivariate normal sampling from average gene expression levels
-
Conditional Variational Autoencoder (CVAE, Sohn et al., 2015) without and with Differential Privacy
-
Conditional Generative Adversarial Networks (CTGAN, Xu et al., 2019) without and with Differential Privacy
Evaluation
The evaluation will be conducted according based on multiple evaluation dimensions:
-
Utility: Downstream utility based on feature performance and feature importance,
-
cancer subtype prediction for TCGA-BRCA,
-
cancer tissue-of-origin for TCGA COMBINED
-
-
Fidelity: Measurement on how well statistical properties of the original data are preserved in the synthetic data.
-
Biological plausibility: Conservation of correlation between features (genes). Detecting sets of co-expressed genes is a common step in transcriptomics data analysis.
-
Privacy: Assessment of privacy preservation of the generated synthetic datasets.
We provide several metrics ready-to-be used in Starter Package Github Repo. We strongly encourage you to explore additional metrics that could provide interesting insights into biological preservations, and include these to your extended abstracts for CAMDA.
✅ Submissions checklist
The following files are required for benchmark method submission, compressed in a zip file, for each dataset:
-
config.yaml: Config file containing your generator configurations, and a team-specific random seed value, consistent through dataset config and your generator config. Please refer to Github repo for details.
-
❗️Please note that a team-specific new value for random seed means that: each blue team has different real and synthetic datasets based on the datasets splits generated based on this value.
-
-
{dataset_name}_split.yaml: Data split yaml file generated with the team-specific random seed value,
-
Five synthetic data sets and corresponding subtype/type labels (single column named Subtype without index) with save_synthetic_data() function:
-
e.g. synthetic_data_split_{split_no}.csv, synthetic_labels_split_{split_no}.csv etc.,
-
-
White-box code: Modified blue_team.py with your generator class included, following the instructions in Github, and other necessary .py files to run your code.
-
environment.yaml: Environment file to run and reproduce the results.
This means we expect two zip files from you in the following file name format,
-
blueteam_{yourteamname}_TCGA-BRCA.zip
-
blueteam_{yourteamname}_TCGA-COMBINED.zip
Please take look at the example submission zip file in Starter Package Github Repo to double check correct formatting.
Good luck! 🍀
References
Please make sure to cite the following papers if any of the baseline methods and evaluation metrics are mentioned/utilized in your CAMDA extended abstracts.
Competition related
-
CAMDA 2025 Health Privacy Challenge
Dataset sources
-
Genomic Data Commons (GDC), https://gdc.cancer.gov/, https://portal.gdc.cancer.gov/, accessed on Nov 1, 2024
Dataset preprocessing
-
Chen, Dingfan, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, and Mario Fritz. "Towards biologically plausible and private gene expression data generation." arXiv preprint. (2024)
-
Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology. (2014)
-
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot et al. "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research. (2016)
-
Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E., Lu, X., Gould, J., Davis, J.F., Tubelli, A.A., Asiedu, J.K. and Lahr, D.L. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell. (2017)
-
Landmark genes, https://clue.io/command?q=/gene-space%20lm, accessed on Nov 1, 2024
Generative models and evaluations
-
Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems. (2015)
-
Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." Advances in neural information processing systems. (2019)
-
Holsten, L., Dahm, K., Oestreich, M., Becker, M., & Ulas, T. "hCoCena: A toolbox for network-based co-expression analysis and horizontal integration of transcriptomic datasets. STAR protocols." (2024)
-
Lun ATL, McCarthy DJ, Marioni JC. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. (2016)
Membership inference attack models
-
Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)
-
Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)
-
Hilprecht, B., Härterich, M., & Bernau, D. "Monte carlo and reconstruction membership inference attacks against generative models." Proceedings on Privacy Enhancing Technologies. (2019)
-
Hayes, J., Melis, L., Danezis, G. & De Cristofaro, E. "Logan: Membership inference attacks against generative models." arXiv preprint. (2019)
Challenge News
Important Dates
Jan 14, 2025: Benchmark method submissions open.
Mar 15, 2025: Phase 1 benchmark submission deadline for blue and red teams.
Mar 22, 2025: Leaderboard announcement.
May 12, 2025: Phase 2 benchmark submission deadline for red teams.
May 15, 2025: CAMDA 2025 extended abstract submission deadline.
May 22, 2025: CAMDA 2025 acceptance notifications are sent.