- Task 4 - Track II: Single-cell RNAseq - Method: Differentially-Private Non-Negative Matrix Factorization
- Method info
method: Differentially-Private Non-Negative Matrix Factorization2025-05-11
Authors: Andrew Wicks
Affiliation: DKFZ
Description: We start by using Non‐negative Matrix Factorization (NMF) to learn a compact, interpretable embedding of the high-dimensional single-cell count matrix. By decomposing the nonnegative counts V into a basis H (metagene signatures) and coefficients W (cell-specific loadings), we capture key biological structure—cell-type signals, batch effects, and variability—in a low-dimensional latent space that is amenable to clustering and synthetic sampling.
H-Perturbation: add Gaussian noise to the basis H under budget εₙₘf, yielding a DP-protected factorization.
DP Clustering: apply KMeans to embeddings W, optionally perturbing centroids with Laplace noise under budget εₖmeans.
Summary Sanitization: compute per-cluster means, variances, and zero-rates on the original counts, then add Laplace noise under budget εₛum when sampling-DP is enabled.
Sampling: draw synthetic counts either from Poisson(μ) or from a zero-inflated negative binomial parameterized by the (noisy) summaries.
Label Assignment: train a Random Forest on real W and true labels, then assign cell-type labels to each synthetic profile by sampling from the model’s predicted probabilities.
All DP modes (all, nmf, kmeans, sampling, none) and the sampling method choice (zinb vs. poisson) are controlled entirely via CLI flags and the config file. The final synthetic datasets are saved as compressed sparse AnnData files whose filenames encode both the experiment name and the DP setting.