Participation Instructions - Health

🫐 Blue Team Home Page

Task Definition

Participants are tasked to develop a method that generates a synthesised copy of the two RNA sequencing datasets, TCGA-BRCA and TCGA COMBINED.
TCGA-BRCA, <1,089 individuals x 978 genes>
- Five subtypes with an imbalanced distribution
- Suitable for subtype prediction task
TCGA COMBINED, <4,323 individuals x 978 genes>
- 10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin, with an imbalanced distribution
- Suitable for cancer tissue-of-origin prediction task
The blue teams will use the provided scripts in Starter Package Github Repo to generate a stratified five-fold dataset splits and the performance of their methods will be evaluated on the average performance.
Each blue team must use a different random seed to generate their dataset splits and run their method.
- For example, in the provided config.yaml in the examples, value of 42, is assigned to the random seed, in both dataset and generator method configurations. In the submission zip examples, value of 41, is used. This means you are not allowed to use these values.
- Blue teams must run their experiments with a unique random seed consistent throughout dataset and generator configurations in the config.yaml file.

Baseline methods

The following baseline methods and their respective performance on the selected evaluation metrics are provided as part of the Starter Package Github Repo for participants.

Multivariate normal sampling from average gene expression levels
Conditional Variational Autoencoder (CVAE, Sohn et al., 2015) without and with Differential Privacy
Conditional Generative Adversarial Networks (CTGAN, Xu et al., 2019) without and with Differential Privacy

Evaluation

The evaluation will be conducted based on multiple evaluation dimensions:

Utility: Downstream utility based on feature performance and feature importance,
- cancer subtype prediction for TCGA-BRCA,
- cancer tissue-of-origin for TCGA COMBINED
Fidelity: Measurement on how well statistical properties of the original data are preserved in the synthetic data.
Biological plausibility: Conservation of correlation between features (genes). Detecting sets of co-expressed genes is a common step in transcriptomics data analysis.
Privacy: Assessment of privacy preservation of the generated synthetic datasets.

We provide several metrics ready-to-be used in Starter Package Github Repo. We strongly encourage you to explore additional metrics that could provide interesting insights into biological preservations, and include these to your extended abstracts for CAMDA.

✅ Submissions checklist

The following files are required for benchmark method submission, compressed in a zip file, for each dataset:

config.yaml: Config file containing your generator configurations, and a team-specific random seed value, consistent through dataset config and your generator config. Please refer to Github repo for details.
- ❗️Please note that a team-specific new value for random seed means that: each blue team has different real and synthetic datasets based on the datasets splits generated based on this value.
{dataset_name}_split.yaml: Data split yaml file generated with the team-specific random seed value,
Five synthetic data sets and corresponding subtype/type labels (single column named Subtype without index) with save_synthetic_data() function:
- e.g. synthetic_data_split_{split_no}.csv, synthetic_labels_split_{split_no}.csv etc.,
White-box code: Modified blue_team.py with your generator class included, following the instructions in Github, and other necessary .py files to run your code.
environment.yaml: Environment file to run and reproduce the results.

This means we expect two zip files from you in the following file name format,

blueteam_{yourteamname}_TCGA-BRCA.zip
blueteam_{yourteamname}_TCGA-COMBINED.zip

Please take look at the example submission zip file in Starter Package Github Repo to double check correct formatting.

Good luck! 🍀

References

Please make sure to cite the following papers if any of the baseline methods and evaluation metrics are mentioned/utilized in your CAMDA extended abstracts.

Competition related

CAMDA 2025 Health Privacy Challenge

Dataset sources

Genomic Data Commons (GDC), https://gdc.cancer.gov/, https://portal.gdc.cancer.gov/, accessed on Nov 1, 2024

Dataset preprocessing

Chen, Dingfan, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, and Mario Fritz. "Towards biologically plausible and private gene expression data generation." arXiv preprint. (2024)
Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology. (2014)
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot et al. "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research. (2016)
Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E., Lu, X., Gould, J., Davis, J.F., Tubelli, A.A., Asiedu, J.K. and Lahr, D.L. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell. (2017)
Landmark genes, https://clue.io/command?q=/gene-space%20lm, accessed on Nov 1, 2024

Generative models and evaluations

Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems. (2015)
Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." Advances in neural information processing systems. (2019)
Holsten, L., Dahm, K., Oestreich, M., Becker, M., & Ulas, T. "hCoCena: A toolbox for network-based co-expression analysis and horizontal integration of transcriptomic datasets. STAR protocols." (2024)
Lun ATL, McCarthy DJ, Marioni JC. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. (2016)

Membership inference attack models

Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)
Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)
Hilprecht, B., Härterich, M., & Bernau, D. "Monte carlo and reconstruction membership inference attacks against generative models." Proceedings on Privacy Enhancing Technologies. (2019)
Hayes, J., Melis, L., Danezis, G. & De Cristofaro, E. "Logan: Membership inference attacks against generative models." arXiv preprint. (2019)

Challenge News

04/29/2025
Reminder: Upcoming submission deadlines.
04/07/2025
Attention Red Teams: The challenge is ready for your solutions!
04/01/2025
Leaderboards for Track I Phase 1 are online.
03/13/2025
Track II is now open for submissions!

Important Dates

Jan 14, 2025: Benchmark method submissions open.

Mar 15, 2025: Track I Phase 1 benchmark submission deadline for blue and red teams.

Mar 22, 2025: Leaderboard announcement.

May 12, 2025: Track I Phase 2 (for red teams) & Track II benchmark submission deadline.

May 15, 2025: CAMDA 2025 extended abstract submission deadline.

May 22, 2025: CAMDA 2025 acceptance notifications are sent.

Jul 23-24, 2025: CAMDA Conference @ ISMB/ECCB 2025 in Liverpool, UK.

Participation Instructions - Health

🫐 Blue Team Home Page

Task Definition

Participants are tasked to develop a method that generates a synthesised copy of the two RNA sequencing datasets, TCGA-BRCA and TCGA COMBINED.

TCGA-BRCA, <1,089 individuals x 978 genes>

Five subtypes with an imbalanced distribution

Suitable for subtype prediction task

TCGA COMBINED, <4,323 individuals x 978 genes>

10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin, with an imbalanced distribution

Suitable for cancer tissue-of-origin prediction task

The blue teams will use the provided scripts in Starter Package Github Repo to generate a stratified five-fold dataset splits and the performance of their methods will be evaluated on the average performance.

Each blue team must use a different random seed to generate their dataset splits and run their method.

For example, in the provided config.yaml in the examples, value of 42, is assigned to the random seed, in both dataset and generator method configurations. In the submission zip examples, value of 41, is used. This means you are not allowed to use these values.

Blue teams must run their experiments with a unique random seed consistent throughout dataset and generator configurations in the config.yaml file.

Baseline methods

The following baseline methods and their respective performance on the selected evaluation metrics are provided as part of the Starter Package Github Repo for participants.

Multivariate normal sampling from average gene expression levels

Conditional Variational Autoencoder (CVAE, Sohn et al., 2015) without and with Differential Privacy

Conditional Generative Adversarial Networks (CTGAN, Xu et al., 2019) without and with Differential Privacy

Evaluation

The evaluation will be conducted based on multiple evaluation dimensions:

Utility: Downstream utility based on feature performance and feature importance,

cancer subtype prediction for TCGA-BRCA,

cancer tissue-of-origin for TCGA COMBINED

Fidelity: Measurement on how well statistical properties of the original data are preserved in the synthetic data.

Biological plausibility: Conservation of correlation between features (genes). Detecting sets of co-expressed genes is a common step in transcriptomics data analysis.

Privacy: Assessment of privacy preservation of the generated synthetic datasets.

We provide several metrics ready-to-be used in Starter Package Github Repo. We strongly encourage you to explore additional metrics that could provide interesting insights into biological preservations, and include these to your extended abstracts for CAMDA.

✅ Submissions checklist

The following files are required for benchmark method submission, compressed in a zip file, for each dataset:

config.yaml: Config file containing your generator configurations, and a team-specific random seed value, consistent through dataset config and your generator config. Please refer to Github repo for details.

❗️Please note that a team-specific new value for random seed means that: each blue team has different real and synthetic datasets based on the datasets splits generated based on this value.

{dataset_name}_split.yaml: Data split yaml file generated with the team-specific random seed value,

Five synthetic data sets and corresponding subtype/type labels (single column named Subtype without index) with save_synthetic_data() function:

e.g. synthetic_data_split_{split_no}.csv, synthetic_labels_split_{split_no}.csv etc.,

White-box code: Modified blue_team.py with your generator class included, following the instructions in Github, and other necessary .py files to run your code.

environment.yaml: Environment file to run and reproduce the results.

This means we expect two zip files from you in the following file name format,

blueteam_{yourteamname}_TCGA-BRCA.zip

blueteam_{yourteamname}_TCGA-COMBINED.zip

Please take look at the example submission zip file in Starter Package Github Repo to double check correct formatting.

Good luck! 🍀

References

Competition related

CAMDA 2025 Health Privacy Challenge

Dataset sources

Genomic Data Commons (GDC), https://gdc.cancer.gov/, https://portal.gdc.cancer.gov/, accessed on Nov 1, 2024

Dataset preprocessing

Chen, Dingfan, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, and Mario Fritz. "Towards biologically plausible and private gene expression data generation." arXiv preprint. (2024)

Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology. (2014)

Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot et al. "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research. (2016)

Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E., Lu, X., Gould, J., Davis, J.F., Tubelli, A.A., Asiedu, J.K. and Lahr, D.L. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell. (2017)

Landmark genes, https://clue.io/command?q=/gene-space%20lm, accessed on Nov 1, 2024

Generative models and evaluations

Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems. (2015)

Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. "Modeling tabular data using conditional gan." Advances in neural information processing systems. (2019)

Holsten, L., Dahm, K., Oestreich, M., Becker, M., & Ulas, T. "hCoCena: A toolbox for network-based co-expression analysis and horizontal integration of transcriptomic datasets. STAR protocols." (2024)

Lun ATL, McCarthy DJ, Marioni JC. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. (2016)

Membership inference attack models

Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)

Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)

Hilprecht, B., Härterich, M., & Bernau, D. "Monte carlo and reconstruction membership inference attacks against generative models." Proceedings on Privacy Enhancing Technologies. (2019)

Hayes, J., Melis, L., Danezis, G. & De Cristofaro, E. "Logan: Membership inference attacks against generative models." arXiv preprint. (2019)

Challenge News

Important Dates