Participation Instructions - Health

General rules

The competition will be run in a private and non-private regime. Participants need to provide results for common DP privacy levels. The evaluation will be conducted according to 3 dimensions.

  • Downstream utility: Classification accuracy on predicting cancer subtypes. The classifier is trained on the synthesized data.
  • Statistical properties: Measurement on how well statistical properties of the original data are preserved in the synthetic data. In particular, the margin distributions as well as the joint distribution (based on KDE estimates) will be measured.
  • Biological plausibility: Conservation of correlation between features (genes). Detecting sets of co-expressed genes is a common step in transcriptomics data analysis. 

Track 1 - Trancsriptome AML synthesization 

Participants are tasked to submit code that generates a synthesised copy of the TCGA AML cancer RNA sequencing dataset (2,300 individuals). This track focuses on RNA abundance estimates only. More information and download information for this specific dataset can be obtained here: TCGA TARGET-AML. The data can be downloaded using the following data manifest using the GDC transfer tool.

Track 2 - Multi-omics Breast Cancer synthesization 

Participants are tasked to submit code that generates a synthesised copy of the TCGA BRA cancer RNA and methylation sequencing dataset (1,098 individuals). This track focuses on synthetic generation of a multi-modal dataset, comprising both RNA and methylation information across a patient cohort. More information and download information for this dataset can be obtained here: TCGA-BRCA. The data can be downloaded using the following data manifest using the GDC transfer tool.

Submissions file format

Submissions are to be submitted as software code that read the input data from track 1 or 2 and create synthesised dataset. Reference code to read the source files for tracks 1 & 2 and create the respective output files will be provided. 

Metrics

The competition will be run in a private and non-private regime. Participants need to provide results for common DP privacy levels. The evaluation will be conducted according to 3 dimensions:

Downstream utility: Classification accuracy on leukemia classification task. The classifier is trained on the synthesized data.

Statistical properties: Measurement on how well statistical properties of the original data are preserved in the synthetic data. In particular, the margin distributions as well as the joint distribution (based on KDE estimates) will be measured.

Biological plausibility: Genes that form a functional group, i.e. that are involved in the same biological pathways under a given condition, are often jointly regulated in their expression. Detecting sets of co-expressed genes is a common step in transcriptomics data analysis. To detect co-expressed genes, one computes the correlation between their expression values across samples. Then filtering for different minimum correlation values to e.g. only focus on stronger co-expressions, is measured.

 

Challenge News

Important Dates

TBA