method: Baseline (Poisson)2025-04-30

Authors: EMBL

Affiliation: EMBL

Description: Implements Poisson distribution based synthetic scRNA-seq generation.

Authors: Andrew Wicks

Affiliation: DKFZ

Description: We start by using Non‐negative Matrix Factorization (NMF) to learn a compact, interpretable embedding of the high-dimensional single-cell count matrix. By decomposing the nonnegative counts V into a basis H (metagene signatures) and coefficients W (cell-specific loadings), we capture key biological structure—cell-type signals, batch effects, and variability—in a low-dimensional latent space that is amenable to clustering and synthetic sampling.

H-Perturbation: add Gaussian noise to the basis H under budget εₙₘf, yielding a DP-protected factorization.
DP Clustering: apply KMeans to embeddings W, optionally perturbing centroids with Laplace noise under budget εₖmeans.
Summary Sanitization: compute per-cluster means, variances, and zero-rates on the original counts, then add Laplace noise under budget εₛum when sampling-DP is enabled.
Sampling: draw synthetic counts either from Poisson(μ) or from a zero-inflated negative binomial parameterized by the (noisy) summaries.
Label Assignment: train a Random Forest on real W and true labels, then assign cell-type labels to each synthetic profile by sampling from the model’s predicted probabilities.

All DP modes (all, nmf, kmeans, sampling, none) and the sampling method choice (zinb vs. poisson) are controlled entirely via CLI flags and the config file. The final synthetic datasets are saved as compressed sparse AnnData files whose filenames encode both the experiment name and the DP setting.

method: Scdesign2 Poisson Ensemble.2025-05-13

Authors: Patrick McKeever, Daniil Filienko, Steven Golob, Shane Menzies, Sikha Pentyala, Luca Foschini, Jineta Banerjee, Martine De Cock

Affiliation: University of Washington Tacoma, Sage Bionetworks

Description: To generate single-cell data, we used a hybrid approach combining scDesign2 and the Poisson generation code of the model baseline. scDesign2 was trained on all highly variable genes (1118 in total) detected by scanpy, while Poisson models were left to fill in the rest. This approach substantially reduced the training time and memory requirements of scdesign2 while retaining similar ARIs. This is a reasonable approach, since the copula-based generation method of scDesign2 will preserve correlations between highly variable genes, while non-highly-variable genes can be accurately approximated with individual distributions.

The novel contribution of our work will be an investigation of the privacy-preserving properties of scdesign2. Our extended abstract, to be submitted Wednesday, will provide more detail.

Full code, including models, is available here: https://github.com/Patrick-McKeever/camda_hpc/tree/main