Overview - Health

HEALTH PRIVACY CHALLENGE - CAMDA 2025

camda.png

                                                          

Participate in the new Track II (Blue Team🫐) featuring:
Single-Cell Gene Expression Dataset!

 

Introduction

Computational health research is centered on sensitive health-care data, including genomic, medical and phenotypic data. Progress in the field hinges on the ability to access these data to advance health care using analytical innovations, while simultaneously ensuring that sensitive information of data subjects is not disclosed. 

 

Privacy preservation is a broad term that encapsulates different approaches that aim to ensure the protection of sensitive information while enabling state-of-the-art solutions to extract insights from data, and can be implemented through mechanisms such as Differential Privacy (DP), federated learning (FL) and synthetic data generation.

 

Synthetic data generation enables privacy preservation through generating data points that are consistent with  the distribution of the real data. Generative models, such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN), can be used for this purpose, allowing to generate synthetic data that maintains the utility of original data while protecting privacy. However, the effectiveness of synthetic data generators in biology, and the extent to which they can protect against adversarial attacks, such as membership inference risks, remain underexplored. 

 

📢 Challenge definition 

The Health Privacy Challenge runs within Conference on Critical Assessment of Massive Data Analysis (CAMDA) @ ISMB / ECCB 2025. The challenge consists of two tracks

Track I: Featuring bulk gene expression data  

Track I is designed in a “Blue Team (🫐) vs Red Team (🍅)” scheme: 

  • The blue teams develop privacy-preserving methods to generate synthetic gene expression datasets, 

  • The red teams launch membership inference attacks (MIA) against blue teams’ solutions.

figure9.png

 

While we encourage implementing Differential Privacy (DP), it is not required in order to participate in the challenge. We invite the participants to advance the field by developing,

  • 🚀 novel privacy preserving generative methods that can mitigate privacy risks while preserving biological insights,

  • 🚀trustworthy and realistic membership inference methods that can inform the community of the potential privacy risks.

The Track I consists of two Phases:

  • Phase 1: 

    • (🫐) Blue teams work towards developing methods that improve the baseline methods and generating novel insights into privacy preservation in biological datasets,

    • (🍅) Red teams launch membership inference attacks (MIA) on the synthetic datasets, generated by the baseline methods. 

  • Phase 2: 

    • After the end of Phase 1, a set of Blue team solutions will be selected, based on their leaderboard performance as well as novelty of their methods.

    • (🍅) During Phase 2, which only Red teams participate, Red teams will launch MIA on Blue team solutions. 

 

Track II: Featuring single-cell gene expression data  

Track II follows a Blue Team (🫐) scheme but maintains a free-form structure. Participants are encouraged to explore both the privacy and utility of synthetic single-cell RNA-seq data. This may include:

  • developing privacy-preserving generative methods that balance privacy and utility,

  • investigating privacy risks within synthetic scRNA-seq datasets,

  • proposing evaluation metrics to assess utility and privacy in a multi-sample setting.

While the Blue Team structure provides a foundation, participants have the freedom to probe privacy risks in synthetic single-cell RNA-seq datasets and innovate methods for safeguarding donor-level privacy.

 

We invite contributions from both members of the computational biology and the privacy community, aiming to create new bridges between these fields. We are looking forward to engaging with you and working together to deepen our understanding of privacy in health care.  🤗

 

🎢 Participation 

In order to successfully participate in the challenge, the participants must,

  • Register through ELSA benchmark platform to access the challenge datasets. We recommend you to register using an organizational email if possible. 

    • After registering, navigate to either Blue Team or Red Team page to create a team as a member. Once the team is created, you can use the team-specific invite link to invite other members to join. 

  • Submit their methods (codes and relevant files) through the ELSA Benchmark Website, under the respective team page. 

    • (🫐) Track I Blue teams must participate in benchmark submission by the Phase 1 deadline. They will overall have one benchmark submission.

    • (🍅)Track I Red teams must participate in two benchmark submissions by the Phase 1 and Phase 2 deadlines. They will overall have two benchmark submission.

    • (🫐) Track II  participants must submit their method and relevant files by the Track II deadline

  • Submit a CAMDA extended abstract that details the submitted method through the ISMB submission system. 

    • (🫐,🍅) Track I and Track II participants both must participate in abstract submission by the CAMDA extended abstract submission deadline. 

We provide a Github Starter Package Repo that includes baselines methods and evaluation metrics for both teams is provided to the participants. More details about the baseline methods, participation rules, submission files and timeline can be found in the repo.

Connect with us through CAMDA Health Privacy Challenge Google Groups for questions, discussions and to follow upcoming announcements.

 

📊 Datasets

Track II: Featuring bulk gene expression data  

We re-distribute two open access TCGA bulk RNA-seq datasets, which can be accessed from the GDC portal (portal.gdc.cancer.gov/), in the pre-processed form:

  • TCGA-BRCA <1,089 individuals x 978 genes> 

    • Five subtypes with an imbalanced distribution

    • Suitable for subtype prediction task 

  • TCGA COMBINED <4,323 individuals x 978 genes>

    • 10 cancer tissues, including Breast, Colorectal, Esophagus, Kidney, Liver, Lung, Ovarian, Pancreatic, Prostate, and Skin,  with an imbalanced distribution 

    • Suitable for cancer tissue-of-origin prediction task 

The datasets are downloaded using TCGABiolinks R package (Colaprico et al., 2016). For each dataset, low count genes are removed and expression counts are normalized using the DeSeq2 R package (Love et al., 2014). Following (Chen, Oestreich, Afonja et al., 2024), the final datasets are reduced to the landmark genes (n=978), which are  identified as representative genes that allow the inference of around 20K other genes, from the LINCS L1000 (Subramanian et al., 2017). Each dataset consists of a single sample per donor.

More details about preprocessing is available in  Github Starter Package Repo

 

Track II: Featuring single-cell gene expression data  

We re-distribute raw counts of OneK1K single-cell RNA-seq dataset, a cohort containing 1.26 million peripheral blood mononuclear cells (PBMCs) of 981 donors (Yazar et al., 2022). After an initial filtering, the dataset is split into donor-based train and test sets of relatively equal numbers of cells, and similar cell-type distributions.

  • Train dataset: < 633711 cells from 490 donors, 25834 genes> 

  • Test dataset:  < 634022 cells from 491 donors, 25834 genes > 

We share these datasets in annData format with the following annotations: individual, barcode_col, cell_type, cell_label. We gratefully acknowledge the authors for granting permission to redistribute this valuable dataset for the challenge.

 

The datasets are available to download in ELSA Benchmark Platform after the registration.

 

🏆 Evaluation

The teams with the best solutions will be determined based on multiple criteria, including,

  • 🎯leaderboard ranking (Track I only),

  • 💡novelty of methods,

  • 🌱generation of novel insights into privacy-preservation in biology. 

Therefore, we strongly encourage the participants to submit their extended abstracts to be evaluated even if they might not have achieved a high ranking on the leaderboards.

 

The winners of the blue and red teams will be invited to present their methods at the CAMDA Conference at ISMB 2025 in Liverpool, and awarded with travel fellowships sponsored by ELSA.


⏳ Timeline

trackii_timeline.png

📍January 14: Submissions open for Track I

  • Please make sure to adhere to the guideline carefully to avoid invalidating your submission.

    • Both blue and red teams must submit a CAMDA extended abstract by the CAMDA submission deadline.

    • Blue teams must have one benchmark method submission by the Phase 1 deadline. 

    • Red teams must have two benchmark method submissions by the Phase 1 and Phase 2 deadlines, respectively. 

  • The teams that do not follow the above criteria will not be considered in the final evaluation. 

📍February 27: Submissions open for Track II 

  • Please make sure to adhere to the guideline carefully to avoid invalidating your submission.

    • A CAMDA extended abstract must be submitted by the CAMDA submission deadline.

    • One benchmark method must be submitted by the Track II method submission deadline. 

  • The teams that do not follow the above criteria will not be considered in the final evaluation. 

📍March 15: First phase deadline, benchmark submissions for both teams.

  • Both teams must complete their benchmark method submission as explained in the submission rules. 

  • This will be the first and final submission for blue teams. 

📍March 22: Announcing leaderboards. 

  • Leaderboards for both teams will be announced based on the Phase 1 submissions. 

  • Red teams will be able to download synthetic datasets provided by the blue teams to work against.  

 📍May 12: Track I Second phase deadline,  benchmark submissions for Red Team. 

  • Red teams must submit their solutions against the blue team solutions. This will be the second and final submission for red teams. 

 📍May 12: Track II benchmark submission deadline. 

  • Track II participants must submit their solutions through the ELSA Benchmark submission system. 

 📍May 15: CAMDA extended abstract submission deadline for both Track I and II (ALL TEAMS!).  

  • For consideration for oral presentations, both blue and red teams must submit a CAMDA extended abstract, that describes the methods they participated in benchmark method submission, through ISMB 2025 submission system. 

 

👥 Organization Team  

The Health Privacy Challenge is designed as a collaborative effort between European Molecular Biology Laboratory (EMBL), CISPA Helmholtz Center for Information Security, and the University of Helsinki with the support of Barcelona Computer Vision Center (CVC) within the context of ELSA Project.  

 

 

Good luck🌟

Challenge News

Important Dates

Jan 14, 2025: Benchmark method submissions open. 

Mar 15, 2025: Phase 1 benchmark submission deadline for blue and red teams. 

Mar 22, 2025: Leaderboard announcement. 

May 12, 2025: Phase 2 benchmark submission deadline for red teams. 

May 15, 2025: CAMDA 2025 extended abstract submission deadline. 

May 22, 2025: CAMDA 2025 acceptance notifications are sent.