Participation Instructions - Health (2026)
🫐 Blue Team Home Page
Login or Register to Elsa to enter / create a Team
Task Definition
Participants are tasked to develop a method that generates a synthesised copy of the two RNA sequencing datasets, TCGA-BRCA and TCGA COMBINED.
-
TCGA-BRCA, <1,089 individuals x 978 genes>
-
Five subtypes with an imbalanced distribution
-
Suitable for subtype prediction task
-
-
TCGA COMBINED, <4,323 individuals x 978 genes>
-
10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin, with an imbalanced distribution
-
Suitable for cancer tissue-of-origin prediction task
-
The blue teams will use the provided scripts in Starter Package Github Repo to generate a stratified five-fold dataset splits and the performance of their methods will be evaluated on the average performance.
Each blue team must use a different random seed to generate their dataset splits and run their method.
-
For example, in the provided config.yaml in the examples, value of 42, is assigned to the random seed, in both dataset and generator method configurations. In the submission zip examples, value of 41, is used. This means you are not allowed to use these values.
-
Blue teams must run their experiments with a unique random seed consistent throughout dataset and generator configurations in the config.yaml file.
Baseline methods
The following baseline methods and their respective performance on the selected evaluation metrics are provided as part of the Starter Package Github Repo for participants.
-
Multivariate normal sampling (MVN) from average gene expression levels
-
Conditional Variational Autoencoder (CVAE, Sohn et al., 2015) without and with Differential Privacy
-
Conditional Generative Adversarial Networks (CTGAN, Xu et al., 2019) without and with Differential Privacy
-
Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP, Arjovsky et al., 2017, Gulrajani et al., 2017)
-
Conditional Variational Autoencoder with Gaussian Mixture Models (CVAE-GMM, Apellániz et al., 2024)
Please take a look at the Health Privacy Challenge 2025 Challenge Blue Team participants’ solutions, featuring models such as Diffusion Models, Probabilistic Graphical Models (PGMs), and NMFs.
Evaluation
The evaluation will be conducted based on multiple evaluation dimensions:
-
Utility: Downstream utility based on feature performance and feature importance,
-
cancer subtype prediction for TCGA-BRCA,
-
cancer tissue-of-origin for TCGA COMBINED
-
-
Fidelity: Measurement on how well statistical properties of the original data are preserved in the synthetic data.
-
Biological plausibility: Conservation of correlation between features (genes). Detecting sets of co-expressed genes is a common step in transcriptomics data analysis.
-
Privacy: Assessment of privacy preservation of the generated synthetic datasets.
We provide several metrics ready-to-be used in Starter Package Github Repo. We strongly encourage you to explore additional metrics that could provide interesting insights into biological preservations, and include these to your extended abstracts for CAMDA.
✅ Submissions checklist
The following files are required for benchmark method submission, compressed in a zip file, for each dataset:
-
config.yaml: Config file containing your generator configurations, and a team-specific random seed value, consistent through dataset config and your generator config. Please refer to Github repo for details.
-
❗️Please note that a team-specific new value for random seed means that: each blue team has different real and synthetic datasets based on the datasets splits generated based on this value.
-
{dataset_name}_split.yaml: Data split yaml file generated with the team-specific random seed value,
-
Five synthetic data sets and corresponding subtype/type labels (single column named Subtype without index) with save_synthetic_data() function:
-
e.g. synthetic_data_split_{split_no}.csv, synthetic_labels_split_{split_no}.csv etc.,
-
-
White-box code: Modified blue_team.py with your generator class included, following the instructions in Github, and other necessary .py files to run your code.
-
environment.yaml: Environment file to run and reproduce the results.
This means we expect two zip files from you in the following file name format,
-
blueteam_{yourteamname}_TCGA-BRCA.zip
-
blueteam_{yourteamname}_TCGA-COMBINED.zip
-
Please take look at the example submission zip file in Starter Package Github Repo. to double check correct formatting.
Good luck! 🍀
References
Please make sure to cite the following papers if any of the baseline methods and evaluation metrics are mentioned/utilized in your CAMDA extended abstracts.
Competition related
- CAMDA 2026 Health Privacy Challenge
Dataset sources
- Genomic Data Commons (GDC), https://gdc.cancer.gov/, https://portal.gdc.cancer.gov/, accessed on Nov 1, 2024
Dataset preprocessing
- Chen, Dingfan, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, and Mario Fritz. "Towards biologically plausible and private gene expression data generation." arXiv preprint. (2024)
- Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology. (2014)
- Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot et al. "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research. (2016)
- Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E., Lu, X., Gould, J., Davis, J.F., Tubelli, A.A., Asiedu, J.K. and Lahr, D.L. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell. (2017)
- Landmark genes, https://clue.io/command?q=/gene-space%20lm, accessed on Nov 1, (2024)
Generative models and evaluations
- Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems. (2015)
- Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International conference on machine learning. PMLR. (2017)
- Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. "Improved training of wasserstein gans." Advances in neural information processing systems. (2017)
- Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." Advances in neural information processing systems. (2019)
- Holsten, L., Dahm, K., Oestreich, M., Becker, M., & Ulas, T. "hCoCena: A toolbox for network-based co-expression analysis and horizontal integration of transcriptomic datasets. STAR protocols." (2024)
- Apellániz, Patricia A., Juan Parras, and Santiago Zazo. "An improved tabular data generator with VAE-GMM integration." 2024 32nd European Signal Processing Conference (EUSIPCO). IEEE. (2024)
Membership inference attack models
- Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)
- Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)
- Hilprecht, B., Härterich, M., & Bernau, D. "Monte carlo and reconstruction membership inference attacks against generative models." Proceedings on Privacy Enhancing Technologies. (2019)
- Hayes, J., Melis, L., Danezis, G. & De Cristofaro, E. "Logan: Membership inference attacks against generative models." arXiv preprint. (2019)
Challenge News
Important Dates
Jan, 2025: Benchmark method submissions open.
May, 2025: Track I & Track II benchmark submission deadline.
May, 2025: CAMDA 2026 extended abstract submission deadline.
Jul 12-16, 2026: CAMDA Conference @ ISMB 2026 in Washington D.C., USA.