Track II - Health

🫐 Track II: Featuring Single-Cell Gene Expression 

Login or Register to Elsa to enter / create a Team

Task definition 

Track II runs in a Blue Team (🫐) scheme, where the participants will develop privacy-preserving methods to generate synthetic single-cell gene expression datasets. Unlike Track I, which is currently running, Track II focuses on multi-sample per donor setting, where privacy is considered at the donor level

We invite the participants to explore the privacy and utility of synthetic single-cell RNA-seq data, for example by,

  • developing novel privacy preserving generative methods that balance privacy and utility,

  • investigating and uncover potential privacy risks associated with generating synthetic single-cell gene expression datasets,

  • proposing novel evaluation metrics and strategies that can assess the utility and privacy-preservation in a multi-sample setting. 

 

Criteria for winner selection

Track II follows a free-form structure, and we will not be using leaderboards for evaluation. The teams with the best solutions will be determined based on multiple criteria, such as,

  • πŸ’‘the novelty of methods,

  • 🌱 the generation of novel insights into privacy-preservation in single-cell datasets. 

The winners will be invited to present their methods at the CAMDA Conference at ISMB 2025 in Liverpool. Travel fellowships will be granted for the participators sponsored by ELSA (https://elsa-ai.eu). 

 

🎒 Participation 

In order to successfully participate in Track II, the participants must,

  • Register through ELSA benchmark platform to access the challenge dataset. We recommend registering using an organizational email if possible. 

  • Submit their methods (codes and relevant files) through the ELSA Benchmark Website. 

  • Submit a CAMDA extended abstract that details their submitted method through the ISMB/ECCB 2025 submission system

We provide a Github Starter Package Repo that includes a baseline method and a set of evaluation metrics to help getting started. Please connect with us through CAMDA Health Privacy Challenge Google Groups for questions, discussions and to follow upcoming announcements.

 

πŸ“Š Datasets

We re-distribute raw counts of OneK1K single-cell RNA-seq dataset, a cohort containing 1.26 million peripheral blood mononuclear cells (PBMCs) of 981 donors (Yazar et al., 2022). After an initial filtering, the dataset is split into donor-based train and test sets of relatively equal numbers of cells, and similar cell-type distributions.

  • Train dataset: < 633711 cells from 490 donors, 25834 genes> 

  • Test dataset:  < 634022 cells from 491 donors, 25834 genes > 

We share these datasets in annData format with the following annotations: individual, barcode_col, cell_type, cell_label.  

Train dataset should be used to train the generator, and the evaluation should be performed against the test set.The datasets are available to download in ELSA Benchmark Website after the registration.

We gratefully acknowledge the authors for granting permission to redistribute this valuable dataset for the challenge.

 

Baseline methods 

We include Poisson as a baseline synthetic data generator and report its performance on the selected evaluation metrics, which are provided in the Starter Package Github Repo.

We encourage participants to explore and compare their generation methods to more advanced synthetic data generation techniques, such as scDesign2 (Sun et al., 2021), scDiffusion (Luo et al., 2024), and others, for further benchmarking and evaluation.

 

πŸ“ˆ Evaluation metrics

To assess the similarity between synthetic and real cells, we evaluated the generated synthetic data using several metrics adopted from ScDiffusion. For computational efficiency, we used Highly Variable Genes (HVG) for these analyses. For MMD, we also had to introduce subsampling to avoid memory related issues. 

We provide these metrics to help kickstart your analysis and strongly encourage you to modify these according to your needs, and to explore additional or novel metrics that could offer deeper insights. 

 

1. Statistical Indicators

  • Spearman Correlation Coefficient (SCC), measures how well the gene rankings correlate between real and synthetic datasets, focusing on the most highly variable genes (HVGs). 

  • Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), measures distributional similarity. 

  • Local Inverse Simpson’s Index (LISI) (Korsunsky et al., 2019), measures how well real and synthetic cells mix together in a shared latent space. 

  • Adjusted Rand Index (ARI), measures how well real and synthetic cells cluster together. 

    • ARI (real vs. synthetic clusters) measures how well synthetic cells cluster similarly to real cells.

    • ARI (ground truth vs. combined clusters) checks whether synthetic data maintains biological structure.

2. Non-Statistical Metrics

  • Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018), visualizes the structure of the synthetic and real cells in 2D. 

  • CellTypist classification (Dominguez Conde et al., 2022), measures whether cell type identity is retrained. 

    • We used the Immune_All_High model from Celltypist to infer, feel free to explore with other existing Celltypist models. 

  • Random forest evaluation measures whether synthetic and real cells can be distinguished. 

 

βœ… Submissions checklist 

The following files are required for benchmark method submission, compressed in a zip file, e.g. trackii_{yourteamname}_onek1k.zip

  • config.yaml: Config file containing your generator configurations.

  • White-box code: Modified blue_team.py with your generator class included, following the instructions in Github, and other necessary .py files  to run your code.

  • environment.yaml: Environment file to run and reproduce the results.

 

Good luck! πŸ€

 

References 

Please make sure to cite the following papers if any of the baseline methods and evaluation metrics are mentioned/utilized in your CAMDA extended abstracts.

Competition related
  • CAMDA 2025 Health Privacy Challenge
Dataset sources
  • Yazar S., Alquicira-Hernández J., Wing K., Senabouth A., Gordon G., Andersen S., Lu Q., Rowson A., Taylor T., Clarke L., Maccora L., Chen C., Cook A., Ye J., Fairfax K., Hewitt A., Powell J. "Single cell eQTL mapping identified cell type specific control of autoimmune disease." Science. (2022) (https://onek1k.org)
Dataset preprocessing
  • Lun ATL, McCarthy DJ, Marioni JC. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. (2016)
  • Wolf, F. Alexander, Philipp Angerer, and Fabian J. Theis. "SCANPY: large-scale single-cell gene expression data analysis." Genome biology. (2018)
Evaluations
  • Luo, E., Hao, M., Wei, L., & Zhang, X. "scDiffusion: conditional generation of high-quality single-cell data using diffusion model." arXiv preprint. (2024)
  • Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., & Sriperumbudur, B. K. "Optimal kernel choice for large-scale two-sample tests." Advances in neural information processing systems. (2012)
  • Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., ... & Raychaudhuri, S. "Fast, sensitive and accurate integration of single-cell data with Harmony." Nature methods. (2019)
  • Luecken, M.D., Büttner, M., Chaichoompu, K. et al. "Benchmarking atlas-level data integration in single-cell genomics." Nat Methods. (2022)
  • McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint. (2018)
  • Domínguez Conde, C., Xu, C., Jarvis, L. B., Rainbow, D. B., Wells, S. B., Gomes, T., ... & Teichmann, S. A. "Cross-tissue immune cell analysis reveals tissue-specific features in humans." Science. (2022)
Membership inference attack models
  • Van Breugel, B., Sun, H., Qian, Z., & van der Schaar, M. "Membership inference attacks against synthetic data through overfitting detection." arXiv preprint. (2023)
  • Chen, D, Yu, N., Zhang, Y., and Fritz, M. "Gan-leaks: A taxonomy of membership inference attacks against generative models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020)

Challenge News

Important Dates

Jan 14, 2025: Benchmark method submissions open. 

Mar 15, 2025: Phase 1 benchmark submission deadline for blue and red teams. 

Mar 22, 2025: Leaderboard announcement. 

May 12, 2025: Phase 2 benchmark submission deadline for red teams. 

May 15, 2025: CAMDA 2025 extended abstract submission deadline. 

May 22, 2025: CAMDA 2025 acceptance notifications are sent.