method: Class-conditional diffusion model with differential privacy2025-03-15
Authors: Oleksandr Husak, ...
Affiliation: NCT/DKFZ
Description: ## Model Architecture and Approach
Our implementation leverages a class-conditional diffusion model with differential privacy.
### Core Components
- Denoising Neural Network: A multi-layer architecture with residual connections
- Adaptive Group Normalization (AdaGN): Custom normalization that dynamically adjusts based on both temporal and class conditioning signals
- Embedding Systems
- Sinusoidal time embeddings for diffusion timestep encoding
- Learned class embeddings for condition-guided generation
- Combined embedding projections to integrate multiple conditioning signals
## Diffusion Process Design
- Noise Schedule: Implements the improved cosine beta schedule from (DDPM++), providing several advantages:
- Smoother noise progression throughout the diffusion process
- Better sampling stability during the generative process
- Forward Process: Gradually adds noise according to the schedule, transforming real data into pure noise
- Reverse Process: Iteratively denoises from random noise to synthetic samples guided by class conditioning
## Privacy Preservation Mechanisms
- Differential Privacy Integration: Implemented via Opacus
- Privacy-Utility Tradeoff: Optimized through multi-objective hyperparameter tuning (MOASHA algorithm)
## Training Methodology
- Class-Weighted Loss Function: Addresses class imbalance using inverse square-root weighting to prevent over-representation of majority classes while stabilizing training
- Learning Rate Scheduling: Tested various schedulers (OneCycleLR, ReduceLROnPlateau, etc.) and selected ReduceLROnPlateau.
## Evaluation Framework
- Validation during training:
- Real-to-synthetic validation (train on real, test on synthetic)
- Synthetic-to-real validation (train on synthetic, test on real)
## Implementation Details
- PyTorch Lightning: Structured training and evaluation loops
- Hyperparameter Optimization: Multi-objective tuning balancing privacy budget (epsilon) and model utility (val_loss)
method: NoisyDiffusion2025-03-15
Authors: Jules Kreuer, Sofiane Ouaari
Affiliation: Methods in Medical Informatics - University of Tübingen
Description: The model implements a diffusion-based generative approach to the provided gene expression data with privacy considerations. At its core, the architecture uses a diffusion model with residual linear blocks with group normalisation.
The diffusion process follows a path where Gaussian noise is progressively added to the data according to different noise scheduling schemes (linear, cosine or power-based). During training, the model learns to predict and remove this noise. For reverse sampling, the model iteratively denoises random Gaussian samples guided by class conditioning to generate synthetic cancer data.
Key technical features include sinusoidal position embeddings for time-step encoding, attention blocks for capturing complex relationships within the data, and a privacy approach that combines differential privacy (by adding calibrated noise to gradients) and strong regularisation.
Additional improvements come from the implementation of early stopping, learning rate scheduling via OneCycleLR.
We explored attention and post-processing techniques such as outlier clipping, but discarded them as too computationally expensive and not useful enough.
The model training process includes gradient clipping to improve numerical stability.
method: Synthetic RNA-seq Data Generation with Private-PGM2025-03-14
Authors: Shane Menzies, Sikha Pentyala, Daniil Filienko, Jineta Banerjee, Martine De Cock
Affiliation: University of Washington Tacoma
Description: We adopt the Private-PGM method implemented by Chen, et al. in "Towards Biologically Plausible and Private Gene Expression Data Generation" (PETS2024), available from https://github.com/MarieOestreich/PRO-GENE-GEN, with parameters tuned for RNA-Seq data, using quantile binning into 4 bins per feature (gene), and differential privacy budget epsilon = 7. Experiments were run on the TACC Frontera / NAIRR Pilot.
Utility | Fidelity | Privacy | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | Accuracy (real) | Accuracy (synthetic) | AUPR (real) | AUPR (synthetic) | Number of overlapping important features | MMD score | Discriminative score | Distance to the closest (real) | Distance to the closest (synthetic) | MC MIA AUC | GAN-leaks MIA AUC | MC MIA PR AUC | GAN-leaks MIA PR AUC | MC MIA TPR@FPR=0.01 | GAN-leaks MIA TPR@FPR=0.01 | MC MIA TPR@FPR=0.1 | GAN-leaks MIA TPR@FPR=0.1 | |||
2025-03-15 | Class-conditional diffusion model with differential privacy | 87.33% | 9.91% | 87.14% | 24.44% | 1.8 | 0.2697 | 99.80% | 24.0301 | 9,887,871.1730 | 49.37% | 50.00% | 84.77% | 90.00% | 49.70% | 100.00% | 49.70% | 100.00% | |||
2025-03-15 | NoisyDiffusion | 87.05% | 77.50% | 85.70% | 75.96% | 18.2 | 0.0109 | 60.39% | 24.0183 | 25.1569 | 52.33% | 53.95% | 80.97% | 82.86% | 1.56% | 3.44% | 10.56% | 13.87% | |||
2025-03-14 | Synthetic RNA-seq Data Generation with Private-PGM | 86.23% | 81.54% | 87.10% | 76.30% | 15.6 | 0.0086 | 86.38% | 24.0422 | 27.2186 | 50.16% | 50.52% | 80.16% | 80.60% | 1.40% | 1.38% | 10.67% | 10.56% | |||
2025-03-08 | Non-negative matrix factorization distorted input for CVAE | 85.67% | 82.09% | 86.00% | 74.57% | 15.4 | 0.0228 | 83.43% | 24.0493 | 26.4565 | 50.62% | 51.34% | 80.27% | 81.03% | 1.33% | 1.79% | 10.65% | 11.36% | |||
2025-03-16 | Baseline (Multivariate) | 86.41% | 82.09% | 85.77% | 83.42% | 20.6 | 0.0166 | 54.36% | 24.0532 | 28.3770 | 52.33% | 52.79% | 81.15% | 81.75% | 1.58% | 2.34% | 11.34% | 11.92% | |||
2025-03-16 | Synthetic RNA-seq Data Generation with Private-PGM (e = 10) | 86.23% | 82.92% | 87.10% | 77.58% | 15 | 0.0074 | 77.69% | 24.0422 | 27.0686 | 50.42% | 50.18% | 80.22% | 80.16% | 1.29% | 1.01% | 10.19% | 9.94% |