method: VineNMF: Privacy-Preserving Synthetic Single-Cell RNA-seq via NMF-Compressed Truncated Vine Copulas2026-04-26
Authors: Andrew Wicks
Affiliation: DKFZ
Description: We present Vine-NMF, a three-stage generative model for privacy-preserving synthetic single-cell RNA-seq (scRNA-seq) data.
Raw counts from 490 training donors (634k cells, OneK1K cohort) are normalized, log-transformed, and reduced to 3,000 highly variable genes (HVGs). NMF (k=75 components, GPU-accelerated) compresses the HVG matrix into a biologically interpretable factor space. A truncated C-vine copula with Frank pair copulas (truncation level T=3) is fitted per cell type in this factor space. New cells are generated by sampling from the vine and projecting back through the NMF factor-gene matrix; counts are drawn from per-gene per-cell-type Negative Binomial distributions with overdispersion estimated by Method of Moments (θ∈[5,1000]).
Architectural choices were determined by component-wise ablation across copula family, vine structure, truncation level, NMF rank, and HVG count. Frank copulas, C-vine topology, T=3 truncation, k=75 NMF rank, and 3,000 HVGs were identified as optimal.
The central finding is that NMF compression, not vine truncation, is the structural privacy mechanism. Three membership inference attacks (GAN-Leaks, RF classifier on NMF pseudobulk features, NMF-DOMIAS) were run across vine truncation levels T=0 to T=10; attack AUROC remained flat at random chance (0.489–0.507) regardless of vine complexity. NMF maps 634k cells to 75 shared biological factors, averaging out individual-level variation by construction.
We additionally conducted a GDPR Article 4 singling-out attack using the predicate singling-out (PSO) framework and a bottleneck escape assay across k∈{20,30,50,75,100}. Uniqueness correlation ρ between real and synthetic donor profiles crosses zero at k=75, identifying the minimum NMF rank sufficient to destroy individual donor distinctiveness.
Final metrics: SCC=0.924, LISI=0.708, RF ROC-AUC=0.829, MIA AUROC=0.505. Code available at https://github.com/AndrewJWicks/Vine-NMF-Single-Cell-Generator
method: scMAMA-MIA2026-05-06
Authors: Steven Golob, Patrick McKeever, Sikha Pentyala, Martine De Cock
Affiliation: University of Washington Tacoma
Description: Team PPML-Huskies — CAMDA 2026 Track II
Blue Team. Three generators, all using Scanpy HVG selection before fitting.
1. scDesign2+DP (ε=100): scDesign2 fits per-cell-type Gaussian copulas with ZINB/NB/Poisson marginals, augmented with a Gaussian DP mechanism on the copula covariance parameters (donor-level ε=100).
2. scVI: A VAE with a ZINB observation model; synthetic cells are sampled from the learned latent distribution.
3. ZINBWave: A zero-inflated negative binomial factor model (Risso et al. 2018) that handles excess zeros via a low-rank latent factor structure.
Red Team. We submit scMAMA-MIA, a Mahalanobis-based membership inference attack adapted from the MAMA-MIA framework (Golob et al., in submission).
A shadow scDesign2 copula is fit on the synthetic dataset (black-box). For each target cell, expression is mapped to [0,1] via marginal CDFs and a Mahalanobis distance is computed under the shadow copula. When auxiliary reference data is available (BB+aux mode), a second copula is fit on that data and the per-cell score becomes log(d_aux) − log(d_synth), contrasting proximity to the synthetic vs. reference distribution. Without auxiliary data (BB−aux mode), the raw inverse Mahalanobis distance is used directly. Both modes are further augmented by a log-likelihood ratio term over secondary genes. Cell scores are z-scored, sigmoided, and averaged per donor to yield a donor-level membership probability.
The attack exploits scDesign2's copula directly encoding training-set gene covariance, making training cells anomalously close to the synthetic copula relative to non-members.
Reference: Golob et al. (2026). "Privacy Vulnerabilities in Synthetic Single-Cell RNA-Sequence Data." Under review. https://www.biorxiv.org/content/10.64898/2026.01.22.701160v1
--- Sun, Tianyi, Dongyuan Song, Wei Vivian Li, and Jingyi Jessica Li. "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured." Genome biology 22, no. 1 (2021): 163. --- Lopez, Romain, Jeffrey Regier, Michael B. Cole, Michael I. Jordan, and Nir Yosef. "Deep generative modeling for single-cell transcriptomics." Nature methods 15, no. 12 (2018): 1053-1058. --- Risso, Davide, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. "A general and flexible method for signal extraction from single-cell RNA-seq data." Nature communications 9, no. 1 (2018): 284.
method: scLDM with Hybrid Poisson Fill: Privacy-Preserving Synthetic OneK1K via Latent Diffusion2026-05-06
Authors: Buse Giledereli, Başar Temiz,Amirreza Sattarzadeh Khanehbargh
Affiliation: Boğaziçi University
Description: We adapt scLDM (Palla et al. 2025, arXiv:2511.02986), a two-stage latent-diffusion generator (transformer VAE + Diffusion Transformer with classifier-free guidance), to the ELSA OneK1K Track II benchmark. scLDM is trained on the top 1,000 highly variable genes for 50 + 50 epochs (VAE then DiT) under cell_type conditioning, with per-cell-type log-library-size factors estimated on the train set to recover realistic Negative Binomial count magnitudes.