Authors: Deborah Boyenval

Affiliation: University of Helsinki

Email: deborah.boyenval@helsinki.fi

Description: We use TabPFN v2.5, a published tabular foundation model developed by PriorLabs (https://doi.org/10.1038/s41586-024-08328-6). TabPFN was originally designed for supervised tasks such as classification and regression, although data generation is supported through the official tabpfn-extensions package. It provides a model named `TabPFNUnsupervisedModel` that estimates an approximate joint distribution of features. Pretrained weights and implementation details are available at https://huggingface.co/Prior-Labs/tabpfn_2_5, running version 2.5 requires exporting an HF_TOKEN because the repository is authenticated.

TabPFN follows the foundation model paradigm: it is pretrained on over one hundred million synthetic datasets and can be used directly for downstream tasks without fine-tuning. Architecturally, TabPFN is a transformer trained to approximate the posterior predictive distribution for tabular inputs. During inference, it performs in-context learning: the "fit" stage corresponds to a single forward pass rather than parameter updates. During the pretraining stage, the model learns to handle a broad range of relationships among features. This setup behaves like amortized Bayesian inference, where the computational cost is paid during pretraining, enabling immediate, out-of-the-box use.

According to a recent report from Prior Labs, TabPFN-2.5 is designed to handle datasets of up to approximately 50K rows and 2K features, representing a major scalability improvement over TabPFN-v2, which was limited to around 10K samples and 500 features. While TabPFN is extremely efficient at inference, the `TabPFNUnsupervisedModel` has a practical limit of roughly 200 features. Its autoregressive factorization of the joint distribution causes the conditioning context to grow rapidly, leading to instability and occasional segmentation faults. To remain within stable bounds, we restricted each TCGA dataset to the 200 most hypervariable genes. Synthetic samples are generated by sampling from the conditional distributions produced at each autoregressive step. This sampling procedure has its own hyperparameter, such as for example a temperature parameter controlling sample diversity, that do not modify the model weights. Temperature is in principle eligible for optimisation, but we used default values and performed no hyperparameter optimisation.

Authors: Oleksandr Husak, ...

Affiliation: NCT/DKFZ

Description: ## Model Architecture and Approach
Our implementation leverages a class-conditional diffusion model with differential privacy.

### Core Components
- Denoising Neural Network: A multi-layer architecture with residual connections
- Adaptive Group Normalization (AdaGN): Custom normalization that dynamically adjusts based on both temporal and class conditioning signals
- Embedding Systems
- Sinusoidal time embeddings for diffusion timestep encoding
- Learned class embeddings for condition-guided generation
- Combined embedding projections to integrate multiple conditioning signals

## Diffusion Process Design
- Noise Schedule: Implements the improved cosine beta schedule from (DDPM++), providing several advantages:
- Smoother noise progression throughout the diffusion process
- Better sampling stability during the generative process
- Forward Process: Gradually adds noise according to the schedule, transforming real data into pure noise
- Reverse Process: Iteratively denoises from random noise to synthetic samples guided by class conditioning

## Privacy Preservation Mechanisms
- Differential Privacy Integration: Implemented via Opacus
- Privacy-Utility Tradeoff: Optimized through multi-objective hyperparameter tuning (MOASHA algorithm)

## Training Methodology
- Class-Weighted Loss Function: Addresses class imbalance using inverse square-root weighting to prevent over-representation of majority classes while stabilizing training
- Learning Rate Scheduling: Tested various schedulers (OneCycleLR, ReduceLROnPlateau, etc.) and selected ReduceLROnPlateau.

## Evaluation Framework
- Validation during training:
- Real-to-synthetic validation (train on real, test on synthetic)
- Synthetic-to-real validation (train on synthetic, test on real)

## Implementation Details
- PyTorch Lightning: Structured training and evaluation loops
- Hyperparameter Optimization: Multi-objective tuning balancing privacy budget (epsilon) and model utility (val_loss)

method: NoisyDiffusion2025-03-15

Authors: Jules Kreuer, Sofiane Ouaari

Affiliation: Methods in Medical Informatics - University of Tübingen

Description: The model implements a diffusion-based generative approach to the provided gene expression data with privacy considerations. At its core, the architecture uses a diffusion model with residual linear blocks with group normalisation.
The diffusion process follows a path where Gaussian noise is progressively added to the data according to different noise scheduling schemes (linear, cosine or power-based). During training, the model learns to predict and remove this noise. For reverse sampling, the model iteratively denoises random Gaussian samples guided by class conditioning to generate synthetic cancer data.
Key technical features include sinusoidal position embeddings for time-step encoding, attention blocks for capturing complex relationships within the data, and a privacy approach that combines differential privacy (by adding calibrated noise to gradients) and strong regularisation.

Additional improvements come from the implementation of early stopping, learning rate scheduling via OneCycleLR.
We explored attention and post-processing techniques such as outlier clipping, but discarded them as too computationally expensive and not useful enough.
The model training process includes gradient clipping to improve numerical stability.

Ranking Table

Description Paper Source Code
UtilityFidelityPrivacy
DateMethodAccuracy (real)Accuracy (synthetic)AUPR (real)AUPR (synthetic)Number of overlapping important featuresMMD scoreDiscriminative scoreDistance to the closest (real)Distance to the closest (synthetic)MC MIA AUCGAN-leaks MIA AUCMC MIA PR AUCGAN-leaks MIA PR AUCMC MIA TPR@FPR=0.01GAN-leaks MIA TPR@FPR=0.01MC MIA TPR@FPR=0.1GAN-leaks MIA TPR@FPR=0.1
2025-11-27Synthetic RNA-seq data generation with a foundation model: TabPFN version 2.5. 0.00%0.00%0.00%0.00%0.00000.00%0.00000.00000.00%0.00%0.00%0.00%0.00%0.00%0.00%0.00%
2025-03-15Class-conditional diffusion model with differential privacy87.33%9.91%87.14%24.44%1.80.269799.80%24.03019,887,871.173049.37%50.00%84.77%90.00%49.70%100.00%49.70%100.00%
2025-03-15NoisyDiffusion87.05%77.50%85.70%75.96%18.20.010960.39%24.018325.156952.33%53.95%80.97%82.86%1.56%3.44%10.56%13.87%
2025-03-14Synthetic RNA-seq Data Generation with Private-PGM86.23%81.54%87.10%76.30%15.60.008686.38%24.042227.218650.16%50.52%80.16%80.60%1.40%1.38%10.67%10.56%
2025-03-08Non-negative matrix factorization distorted input for CVAE85.67%82.09%86.00%74.57%15.40.022883.43%24.049326.456550.62%51.34%80.27%81.03%1.33%1.79%10.65%11.36%
2025-03-16Baseline (Multivariate) 86.41%82.09%85.77%83.42%20.60.016654.36%24.053228.377052.33%52.79%81.15%81.75%1.58%2.34%11.34%11.92%
2025-03-16Synthetic RNA-seq Data Generation with Private-PGM (e = 10)86.23%82.92%87.10%77.58%150.007477.69%24.042227.068650.42%50.18%80.22%80.16%1.29%1.01%10.19%9.94%

Ranking Graphic

Ranking Graphic

Ranking Graphic

Ranking Graphic

Ranking Graphic

Ranking Graphic