Bayesian Denoising¶
This page summarizes the theoretical foundation for Bayesian denoising of single-cell transcriptional profiles in SCRIBE. For the full derivation with all algebraic steps, see the accompanying paper.
Motivation¶
The generative model describes how observed UMI counts arise from true transcript counts via stochastic capture. Once the model is fitted, a natural question arises: given an observed cell profile, what was the most likely "true" transcriptional profile before capture loss and dropout?
This section shows that the denoising problem has a clean, closed-form Bayesian solution that falls directly out of the conjugacy structure underlying the model.
Problem Setup¶
Recall the two-stage generative process for a single gene \(g\) in cell \(c\). The true (unobserved) transcript count follows a negative binomial:
and the observed UMI count arises from binomial sub-sampling:
where \(\nu_c \in (0, 1)\) is the cell-specific capture probability. The denoising problem is to compute \(\pi(m_g \mid u_g, r_g, p, \nu_c)\).
Key Result: Denoised Count Distribution¶
The derivation exploits the Poisson-Gamma representation of the negative binomial. Writing \(m_g = u_g + d_g\) (captured + uncaptured transcripts) and applying the Poisson thinning property gives conditional independence of \(u_g\) and \(d_g\) given the latent Gamma rate \(\lambda_g\). A standard Bayesian update of the Gamma prior with the Poisson likelihood yields the posterior:
Marginalizing over \(\lambda_g\), the number of uncaptured transcripts follows a negative binomial, giving the denoised count distribution:
The denoised count equals the observed count \(u_g\) plus a random correction \(d_g\) drawn from a negative binomial whose shape depends on both the prior dispersion \(r_g\) and the observation \(u_g\).
Point Estimates¶
Posterior mean (shrinkage estimator)¶
The posterior mean of the denoised count has a compact closed form:
This is a shrinkage estimator: the numerator combines the observed count \(u_g\) with a prior correction \(r_g(1 - \nu_c)(1 - p)\) that pulls the estimate toward the prior mean, while the denominator scales by the effective capture probability.
Limiting behavior¶
The posterior mean exhibits sensible behavior in several important limits:
| Limit | Result | Interpretation |
|---|---|---|
| Perfect capture (\(\nu_c = 1\)) | \(\langle m_g \rangle = u_g\) | No denoising needed |
| Zero observation (\(u_g = 0\)) | \(\langle m_g \rangle = \frac{r_g(1-\nu_c)(1-p)}{\nu_c + p(1-\nu_c)}\) | Strictly positive -- prior drives the estimate |
| Weak prior (\(r_g \to 0\)) | \(\langle m_g \rangle \approx \frac{u_g}{\nu_c + p(1-\nu_c)}\) | Naive rescaling by capture probability |
| Strong signal (\(u_g \gg r_g\)) | Same as weak prior | Data dominate |
Posterior variance¶
The posterior variance quantifies denoising uncertainty:
This variance decreases as \(\nu_c \to 1\) (better capture means less uncertainty) and increases with \(u_g\) (larger counts imply a wider range of plausible true counts).
Extension to Zero-Inflated Models¶
For ZINB models, the observed distribution includes a point mass at zero via a technical dropout gate \(g\):
Non-zero observations (\(u_g > 0\))¶
The gate could not have produced a positive count, so the denoising is identical to the pure NB case above.
Zero observations (\(u_g = 0\))¶
When \(u_g = 0\), the zero may have come from either pathway. By Bayes' rule, the posterior probability that the gate was responsible is:
The denoised count distribution is a mixture of two components:
- Gate pathway (weight \(w\)): the cell was expressing the gene normally but technical dropout prevented observation. The denoised count is drawn from the biological prior \(\text{NB}(r_g, p)\).
- NB pathway (weight \(1 - w\)): the zero reflects genuine low expression combined with capture loss. The standard denoising formula applies with \(u_g = 0\).
The gate pathway mean is always larger than the NB pathway mean when \(\nu_c < 1\), reflecting the intuition that dropout-induced zeros correspond to higher expected true expression.
Models without variable capture probability¶
For ZINB or NBDM models without cell-specific capture, set \(\nu_c = 1\). Then \(d_g = 0\) almost surely, so \(m_g = u_g\) for all non-zero observations. The zero-inflation denoising still applies for \(u_g = 0\):
Fully Bayesian Denoising¶
The formulas above condition on point estimates of \((r_g, p, \nu_c, g)\). In a fully Bayesian treatment, uncertainty is propagated by repeating the denoising for each posterior sample:
This produces a full distribution over denoised counts that reflects both the stochasticity of the capture process and the uncertainty in the model parameters.
Cross-Gene Correlations¶
A key insight is that all cross-gene correlation in the denoised counts arises from parameter uncertainty. The law of total covariance gives:
When parameters are fixed at their MAP estimates, denoised counts for different genes are independent. Correlations appear only when accounting for the joint posterior.
Correlated denoising algorithm¶
In practice, cross-gene correlations are captured automatically by the following procedure:
- Draw one joint parameter sample from the posterior
- Denoise each gene independently using the per-gene formulas with the drawn parameters
- Assemble the denoised count matrix
The shared parameter draw in step 1 propagates cross-gene correlations automatically. The quality of these correlations depends on the inference method:
| Inference method | Cross-gene correlations |
|---|---|
| MCMC | Exact (each draw is a full joint sample) |
| SVI with joint low-rank guide | Captured via low-rank variational family |
| SVI with factorized guide | Only through shared global parameters (\(p\), \(\nu_c\)) |
| MAP point estimates | None (genes are independent) |
Preserving correlations in practice
When exporting denoised datasets, set preserve_correlations=True to
ensure all denoised matrices use individual posterior draws rather than
the posterior mean. This preserves cross-gene correlations for downstream
analysis.
Using Denoising in SCRIBE¶
The denoising theory is implemented in the results objects returned by
scribe.fit(). Both SVI and MCMC results provide the same high-level API.
The method parameter¶
The method parameter controls how the denoised posterior is summarized into
a point estimate for each cell-gene entry:
| Method | Description |
|---|---|
"mean" |
Closed-form posterior mean -- the shrinkage estimator derived above. Deterministic (no RNG needed for non-ZINB models). |
"mode" |
Posterior mode (MAP denoised count). The integer floor of the continuous mode formula. Always less than or equal to the mean. |
"sample" |
One stochastic draw from the full denoised posterior distribution. Preserves the shape of the distribution, including its skewness and discreteness. |
For zero-inflated models, zeros require special treatment because they arise
from a mixture of two pathways (gate vs NB). The method parameter accepts
a tuple (general_method, zi_zero_method) for independent control:
general_method-- applied to all non-zero positions and to all positions in non-ZINB modelszi_zero_method-- applied exclusively to zero positions in ZINB models, where the gate/NB mixture posterior must be resolved
A single string "mean" is shorthand for ("mean", "mean").
Recommended: method=("sample", "sample")
For most downstream analyses, sampling is the preferred method. It
generates realistic count matrices that faithfully represent the full
denoised posterior, including the discrete, skewed nature of transcript
counts. For ZINB zeros, "sample" uses the gate weight \(w\) as a
Bernoulli probability: dropout zeros are replaced with draws from the
biological prior \(\text{NB}(r_g, p)\), while genuine NB zeros are kept
at zero. This produces the most faithful representation of what the true
transcriptome likely looked like.
By contrast, "mean" produces smooth, non-integer values that
over-regularize the distribution, and "mode" tends to underestimate
counts (especially for low-expression genes where the mode is zero).
MAP-based denoising (fast, no cross-gene correlations)¶
Uses the MAP point estimates of the parameters. Fast but loses cross-gene correlations:
import jax
rng_key = jax.random.PRNGKey(0)
# SVI results
denoised = results.denoise_counts_map(
counts=adata.X,
method=("sample", "sample"),
rng_key=rng_key,
)
# MCMC results
denoised = results.denoise_counts(
counts=adata.X,
method=("sample", "sample"),
rng_key=rng_key,
)
Full-posterior denoising (recommended, with cross-gene correlations)¶
Uses multiple posterior draws, automatically propagating cross-gene correlations through the shared parameter samples:
# SVI results -- uses multiple posterior draws
denoised = results.denoise_counts_posterior(
counts=adata.X,
method=("sample", "sample"),
rng_key=rng_key,
)
Biological posterior predictive checks¶
Biological PPCs strip technical noise (capture probability, zero-inflation gate) and sample from the base \(\text{NB}(r, p)\), giving a denoised view of the data:
Exporting denoised counts as AnnData¶
# Single denoised dataset
adata_denoised = results.get_denoised_anndata(
counts=adata.X,
rng_key=rng_key,
method=("sample", "sample"),
)
# Multiple datasets preserving cross-gene correlations
adatas = results.get_denoised_anndata(
counts=adata.X,
rng_key=rng_key,
n_datasets=5,
preserve_correlations=True,
method=("sample", "sample"),
path="denoised.h5ad",
)
For the full API details, see the Results Class guide and the API Reference for SVI or API Reference for MCMC.
Next steps
- See the Dirichlet-Multinomial Model for the NB generative model and capture-probability derivation that underpin the denoising result.
- See Beta Negative Binomial for how denoising extends to power-law tail distributions via a one-dimensional integral over the latent Beta mixing variable.
- See the Model Selection guide for practical details on the zero-inflated models.
- See the Results Class guide for the full denoising
API, including
denoise_counts,denoise_counts_posterior, andget_denoised_anndata.