Model Selection¶
SCRIBE provides a family of probabilistic models for scRNA-seq data, all built
on the foundational Negative Binomial (NB) distribution dictated by the
biophysics of transcription and mRNA capture. You choose a likelihood by
setting variable_capture and zero_inflation on scribe.fit(). The
default model is NBVCP (variable capture on). Setting
variable_capture=False gives NBDM; adding zero_inflation=True gives
ZINBVCP (or ZINB when combined with variable_capture=False). The same four variants are still
available as model="nbdm" / "nbvcp" / "zinb" / "zinbvcp" if you prefer a
single string. Optional extensions (BNB overdispersion, mixture components) use
other scribe.fit() arguments.
Default: variable capture (NBVCP)¶
In typical scRNA-seq data, per-cell total UMI counts vary widely across the
experiment. SCRIBE defaults to NBVCP (model="nbvcp"), which absorbs
library-size variation into a cell-specific technical capture channel.
Empirically, we have not yet encountered a dataset that does not benefit
from this extension.
# Default: variable capture is on
results = scribe.fit(adata)
# Add a low-rank guide for gene-gene correlations
results = scribe.fit(adata, guide_rank=64)
The guide_rank parameter adds a low-rank component to the variational
posterior, giving SCRIBE a parameter-efficient way to capture gene-gene
correlations that a mean-field guide would miss. A rank of 64 is a good
initial value (you can always increase the rank or switch to a normalizing flow
guide if you want a more expressive posterior); see Variational Guide
Families for details.
When NBDM is reasonable (variable_capture=False, same as model="nbdm"):
total UMIs per cell are very homogeneous (e.g. max/min total UMI within
roughly a factor of two, after basic QC), so a shared effective capture is a
good approximation.
Zero inflation is often optional
Explicit zero inflation (zero_inflation=True; ZINB / ZINBVCP,
or model="zinb" / "zinbvcp") is not always necessary. Apparent "excess
zeros" frequently arise because low capture cells produce many zeros
across genes. Fitting variable capture first (variable_capture=True;
NBVCP) often explains those zeros without a separate dropout layer. Add
zero inflation only when diagnostics and model comparison show a clear
gain after a good VCP fit.
The model family at a glance¶
graph TD
NB["Negative Binomial<br/><b>model='nbdm'</b> (variable_capture=False)<br/><i>base: r_g, p</i>"]
ZINB["Zero-Inflated NB<br/><b>model='zinb'</b> (zero_inflation=True)<br/><i>+ gate_g</i>"]
NBVCP["NB + Variable Capture<br/><b>model='nbvcp'</b> (default)<br/><i>+ nu_c</i>"]
ZINBVCP["ZINB + Variable Capture<br/><b>model='zinbvcp'</b> (variable_capture=True, zero_inflation=True)<br/><i>+ gate_g, nu_c</i>"]
NB -->|"+ zero inflation"| ZINB
NB -->|"+ variable capture"| NBVCP
ZINB -->|"+ variable capture"| ZINBVCP
NBVCP -->|"+ zero inflation"| ZINBVCP
NB -.->|"+ BNB overdispersion"| BNB["Any model +<br/><b>overdispersion='bnb'</b><br/><i>+ kappa_g</i>"]
NB -.->|"+ mixture"| MIX["Any model +<br/><b>n_components=K</b><br/><i>K sets of gene params</i>"]
Decision guide¶
graph TD
Start["Assess total UMI per cell"] --> Qtot{"Max total UMI / min total UMI<br/>roughly within 2x?"}
Qtot -->|Yes, very tight| Base["NBDM may suffice<br/>model='nbdm' (variable_capture=False)"]
Qtot -->|No, wider spread| VCP0["Use default NBVCP<br/>model='nbvcp' (default)"]
Base --> Qzi{"Still poor fit / excess zeros<br/>after NBDM + PPC?"}
Qzi -->|Try VCP first| VCP0
Qzi -->|Yes, after VCP| ZI["Consider ZINB or ZINBVCP<br/>(zero_inflation=True, ± VCP)"]
VCP0 --> Qzi2{"Excess zeros remain<br/>after good VCP fit?"}
Qzi2 -->|No| Qtail{"Heavy tails remain<br/>in PPC after VCP?"}
Qzi2 -->|Yes| ZINBV["ZINBVCP<br/>model='zinbvcp' (variable_capture=True, zero_inflation=True)"]
ZINBV --> Qtail
Qtail -->|Yes| BNB["Add BNB overdispersion<br/>overdispersion='bnb'"]
Qtail -->|No| Qmix{"Multiple cell<br/>populations?"}
BNB --> Qmix
Qmix -->|Yes| Mix["Add mixture<br/>n_components=K"]
Qmix -->|No| Done["Tune with model comparison"]
Mix --> Done
Use model comparison and goodness-of-fit diagnostics to justify adding ZI, BNB, or extra mixture components.
Variable capture probability (NBVCP)¶
When cells differ in mRNA capture efficiency, the same underlying expression can produce very different total UMIs. The VCP extension introduces a cell-specific capture probability \(\nu^{(c)}\) that modifies the base success probability:
so \(u_g^{(c)} \mid r_g, \hat{p}^{(c)}\) is distributed as \(\text{NB}(r_g, \hat{p}^{(c)})\). Genes are modelled independently given \(\nu^{(c)}\) (no Dirichlet-Multinomial factorization in this likelihood).
For many cells, amortized capture inference scales better:
See also: Theory: Anchoring priors (capture identifiability and priors).
Base: Negative Binomial (NBDM)¶
When total UMIs per cell are homogeneous, a single effective capture shared across cells is often adequate. The NB likelihood for UMIs is
with gene-specific \(r_g\). When \(\hat{p}\) is shared across genes, the joint distribution factorizes into a Negative Binomial for totals and a Dirichlet-Multinomial for compositions---principled normalization without ad-hoc library-size scaling.
See also: Theory: Dirichlet-Multinomial.
Zero inflation (ZINB)¶
Adds a per-gene gate \(\pi_g\) for technical dropout:
Prefer variable_capture=True first when library sizes vary; add
zero_inflation=True (ZINB or ZINBVCP, or model="zinb" / "zinbvcp") only
when the data still need an explicit dropout layer after a strong VCP fit.
Both: ZINBVCP¶
Combines zero inflation and variable capture:
Highest flexibility and cost; use when both mechanisms are supported by diagnostics.
# ZINBVCP; equivalent to model="zinbvcp"
results = scribe.fit(adata, variable_capture=True, zero_inflation=True)
BNB overdispersion¶
Beta Negative Binomial adds heavy tails via a Beta randomization of the NB
success probability and an extra \(\kappa_g\). Requires unconstrained=True.
Fit variable capture first
What looks like heavy tails in the raw data can often be explained by variable capture efficiency: cells with high capture produce counts in the upper tail while low-capture cells pile up near zero, mimicking a heavier-tailed distribution. Fit an NBVCP model first, then check the posterior predictive distribution. Add BNB only when genuine per-gene excess dispersion remains after accounting for capture.
results = scribe.fit(
adata,
overdispersion="bnb",
unconstrained=True,
# Any likelihood: default NBVCP, or e.g. variable_capture=False,
# zero_inflation=True, or both (same as model="nbdm" / "nbvcp" / …).
)
See also: Theory: Beta Negative Binomial.
Mixture components¶
Any of the above supports n_components=K for subpopulations. Gene-specific
parameters (\(r_g\), and \(\pi_g\) if applicable) are usually
component-specific; global \(p\) and cell-specific \(\nu^{(c)}\) stay shared as
in the base construction.
results = scribe.fit(
adata,
variable_capture=True,
n_components=3,
n_steps=150_000,
)
assignments = results.cell_type_assignments(counts=adata.X)
results = scribe.fit(
adata,
zero_inflation=True,
n_components=3,
mixture_params="mean", # only expression-level param varies by component
)
See also: Results class (mixture assignments and components).
Log-Normal model family (PLN / NBLN / LNM / LNMVCP)¶
Beyond the Negative Binomial family above, SCRIBE offers a second model family rooted in the biophysics of gene regulatory networks. When genes interact, the steady-state distribution of log-abundances is a multivariate Gaussian, whose covariance encodes regulatory structure. Several observation models connect this latent Gaussian to observed counts:
graph TD
GRN["Multivariate Gaussian<br/>on log-abundances<br/>(GRN steady state)"]
PLN["Poisson Log-Normal<br/><b>model='pln'</b><br/><i>Poisson on rate</i>"]
NBLN["NB Log-Normal<br/><b>model='nbln'</b><br/><i>NB on rate, per-gene r_g</i>"]
LNM["Logistic-Normal Multinomial<br/><b>model='lnm'</b><br/><i>softmax + Multinomial</i>"]
LNMVCP["LNM + Variable Capture<br/><b>model='lnmvcp'</b><br/><i>+ per-cell NB totals</i>"]
GRN -->|"exp(x_g) -> Poisson"| PLN
GRN -->|"exp(x_g) -> NB(r_g)"| NBLN
GRN -->|"softmax -> compositions"| LNM
LNM -->|"+ capture eta"| LNMVCP
When to use the log-normal family¶
| Criterion | NB family (default) | Log-normal family |
|---|---|---|
| Gene-gene correlations | Captured only in the variational posterior (not generative) | Explicit in the generative model via low-rank \(\Sigma\) |
| Compositional analysis | Dirichlet (pairwise-negative correlations only) | Logistic-normal (arbitrary correlations) |
| Biophysical grounding | Independent-gene exact (NB from bursty promoter) | Interacting-gene approximate (LNA / Lyapunov) |
| Recommended inference | SVI (default) | Laplace (inference_method="laplace") |
Poisson Log-Normal (model="pln")¶
Models each gene's count as Poisson from the exponentiated log-abundance. Total count distribution emerges naturally (no separate NB for totals). Capture enters as an additive log-rate offset.
results = scribe.fit(
adata,
model="pln",
inference_method="laplace",
guide_rank=64,
n_steps=50_000,
)
Best for: Gene-level denoising with full correlation structure, heavy-tailed total counts, log-concave posterior guarantees.
See also: Theory: Poisson Log-Normal
NB Log-Normal (model="nbln")¶
PLN's heavier-tailed sibling: same multivariate-Gaussian prior on log-rates, but replaces the Poisson observation channel with a Negative Binomial that gives each gene an explicit dispersion parameter \(r_g\). Restores bursty-transcription overdispersion at the per-gene level while preserving PLN's log-concave posterior and cross-gene covariance structure. Recommended pipeline is an SVI-cascade + freeze + loadings shrinkage fit:
import numpy as np
# Step 1: NBVCP-SVI cascade source
svi_results = scribe.fit(
adata, model="nbvcp", parameterization="mean_odds",
priors={"capture_efficiency": (np.log(100_000), 0.5)},
inference_method="svi", n_steps=50_000,
)
# Step 2: NBLN-Laplace with cascade freeze + loadings shrinkage
laplace_results = scribe.fit(
adata, model="nbln", inference_method="laplace",
informative_priors_from=svi_results, # Phase-1 soft cascade
informative_priors_freeze=("r", "eta"), # Phase-2 freeze (default)
priors={
"capture_efficiency": (np.log(100_000), 0.5),
"loadings": { # Phase-3 shrinkage
"type": "horseshoe_columnwise", "tau_scale": 1.0,
},
},
latent_dim=16,
n_steps=20_000,
)
# Inspect effective rank + correlation structure
print(laplace_results.w_prior_diagnostics["effective_rank"])
W_perp = laplace_results.get_W_compositional()
Best for: Bursty scRNA-seq data, cross-gene regulatory correlation recovery, cascade-frozen fits with adaptive rank selection.
Key concepts:
- Per-cell rigid-translation gauge: NBLN has a \(C\)-dimensional gauge (one per cell) that the Phase-2 freeze on \(\eta\) pins structurally.
- Loadings shrinkage: adaptive
rank selection via
priors={"loadings": {...}}— lets you keeplatent_dimgenerous. get_W_compositional(): gauge-invariant projection \(\underline{\underline{W}}_\perp\). Use this (not raw \(\underline{\underline{W}}\)) for cross-gene correlation analysis.get_gauge_diagnostics(): quantifies how much of raw \(\underline{\underline{W}}\) is gauge slop. Without loadings shrinkage, clean fits showgauge_contamination_ratio < 0.05. With loadings shrinkage active, ratios of 0.5–0.8 are routine and benign — the prior shrinksW_⟂much faster thanW_∥, so the ratio climbs by design; inspect absolute norms instead. See Loadings Shrinkage.
See also: Theory: NB Log-Normal and Theory: Loadings Shrinkage.
Logistic-Normal Multinomial (model="lnm")¶
Factorizes counts into NB totals and Multinomial compositions drawn from a logistic-normal distribution on the simplex.
results = scribe.fit(
adata,
model="lnm",
inference_method="laplace",
guide_rank=64,
n_steps=50_000,
)
Best for: Compositional analysis where explicit normalization is desired and total-composition independence is acceptable.
See also: Theory: Logistic-Normal Multinomial
LNM with Variable Capture (model="lnmvcp")¶
Adds a per-cell capture latent \(\eta^{(c)}\) that modifies the total count distribution while leaving the composition block unchanged. The composition and capture have a block-diagonal Hessian, so they decouple cleanly during Newton iteration.
results = scribe.fit(
adata,
model="lnmvcp",
inference_method="laplace",
guide_rank=64,
n_steps=50_000,
)
Best for: Compositional analysis with heterogeneous library sizes; avoids encoder collapse on the capture latent.
Two-state promoter (Poisson-Beta)¶
For genes that the NB family cannot fit — typically bursty / bimodal genes with simultaneous excess zeros AND a heavy right tail, or a literal bimodal count histogram — the two-state promoter likelihood replaces the NB with a Poisson-Beta compound:
with \(p_g^{(c)}\) independent per (gene, cell). The NB is recovered as a limiting case at large \(k^-\), so the two-state model nests inside the NB family rather than competing with it. Use it when posterior predictive checks of the NB family show a systematic mismatch you cannot fix with NB parameter changes (excess zeros next to a heavy right tail, or a literal bimodal histogram).
# Bursty genes with variable capture
results = scribe.fit(
adata,
model="twostatevcp",
parameterization="natural",
unconstrained=True,
inference_method="svi",
)
The TwoState family ships with four parameterizations of the gene-level
shape. All four sample mu and two additional per-gene parameters; all four
are mean-preserving by construction; mean_fano and moment_delta
additionally preserve the Fano factor:
| Parameterization | Sampled extras | When to choose |
|---|---|---|
two_state_natural |
burst_size, k_off |
Biophysical interpretation; NUTS |
two_state_ratio |
burst_size, switching_ratio (\(k^-/k^+\)) |
Mean-field SVI across widely varying \(\mu\) |
two_state_mean_fano |
excess_fano (\(F\)), concentration (\(\kappa\)) |
When PPC bands are systematically wider than the observed variance |
two_state_moment_delta |
excess_fano, inv_concentration (\(\delta = 1/(\kappa+1)\)) |
When \(\kappa\) posterior tracks its prior under mean_fano |
See Two-state promoter theory for the full math and a decision guide.
Mixture models (n_components=K) are fully supported for all four TwoState
parameterizations, with the same API as the NB family. Biology-informed capture
priors and the biological PPC sampler are also supported — both rely on the
closure-under-binomial-thinning property that makes the capture factor
likelihood-agnostic.
Current limitations: VAE inference, multi-dataset indexing, BNB overdispersion, and the Poisson-Gamma denoiser are not yet wired for TwoState. Build-time validation rejects these combinations with a clear error.
Comparison table¶
| Model | Zero Inflated | Variable Capture | BNB | Mixture | Best For |
|---|---|---|---|---|---|
"nbdm" (variable_capture=False) |
-- | -- | opt. | opt. | Tight total-UMI distribution (~within 2x) |
"nbvcp" (default) |
-- | Yes | opt. | opt. | Typical data; heterogeneous library sizes |
"zinb" (zero_inflation=True) |
Yes | -- | opt. | opt. | Excess zeros after VCP ruled out / no VCP |
"zinbvcp" (variable_capture=True, zero_inflation=True) |
Yes | Yes | opt. | opt. | Strong evidence for both ZI and VCP |
"pln" |
-- | log-offset | -- | -- | Gene-level correlation recovery, heavy tails |
"nbln" |
-- | log-offset | -- | -- | Bursty cross-gene correlations, cascade-frozen fits |
"lnm" |
-- | -- | -- | -- | Compositional analysis, arbitrary correlations |
"lnmvcp" |
-- | per-cell η | -- | -- | Compositional + heterogeneous library sizes |
"twostate" |
-- | -- | -- | opt. | Bursty / bimodal genes the NB cannot fit (no capture) |
"twostatevcp" |
-- | per-cell ν | -- | opt. | Bursty / bimodal genes with variable library sizes |
"opt." = add overdispersion="bnb" or n_components=K.
Parameterizations¶
NB family (nbdm / zinb / nbvcp / zinbvcp)¶
Each NB-family model can be parameterized in three ways (how the NB
parameters are represented internally). SCRIBE names them canonical,
mean probs, and mean odds; the parameterization= string uses the
codes below (aliases in parentheses).
| Name | parameterization= |
Samples | Derives | Best For |
|---|---|---|---|---|
| Canonical | "canonical" (alias "standard") |
\(p, r\) | -- | Direct interpretation |
| Mean probs | "mean_prob" (alias "linked") |
\(p, \mu\) | \(r = \mu(1-p)/p\) | Couples mean and success probability |
| Mean odds | "mean_odds" (alias "odds_ratio") |
\(\phi, \mu\) | \(p = 1/(1+\phi)\), \(r = \mu\phi\) | Stable when \(p\) is near 1 |
results = scribe.fit(
adata,
variable_capture=True,
parameterization="mean_prob",
)
# equivalent: parameterization="linked"
TwoState family (twostate / twostatevcp)¶
The TwoState family has its own four parameterizations of the gene-level
shape. All four sample mu and two additional per-gene parameters; all
four are mean-preserving by construction. Each successive variant fixes a
distinct geometric pathology of mean-field variational inference:
| Name | parameterization= |
Aliases | Samples |
|---|---|---|---|
| Natural | "two_state_natural" |
natural |
\(\mu, b, k^-\) |
| Ratio | "two_state_ratio" |
ratio |
\(\mu, b, s = k^-/k^+\) |
| Mean-Fano | "two_state_mean_fano" |
mean_fano, fano |
\(\mu, F, \kappa\) |
| Moment-delta | "two_state_moment_delta" |
moment_delta, delta |
\(\mu, F, \delta = 1/(\kappa+1) \in (0, 1)\) |
The natural variant is the physics-natural choice and is recommended for NUTS or for biophysical interpretation. For mean-field SVI across many genes with widely varying expression, the ratio variant decouples the regime axis from gene magnitude. When posterior-predictive variance is the visible failure mode, mean_fano samples the Fano factor directly, which bounds the PPC width by construction. Moment-delta additionally maps the unbounded NB-limit ridge to a bounded shape coordinate \(\delta \in (0, 1)\) so the variational guide doesn't waste mass on arbitrarily-large concentration values. See Two-state promoter theory for the math.
For a complete mapping of every parameter name to its symbol, domain, and equation context, see the Parameter Reference.
Constrained (default) vs unconstrained (Normal + transforms; required for hierarchical priors and BNB):
Hierarchical priors¶
With unconstrained=True, you can use hierarchical priors on gene-specific
parameters (\(\mu\), \(p\), gate, overdispersion):
results = scribe.fit(
adata,
unconstrained=True,
expression_prior="horseshoe",
prob_prior="gaussian",
)
See also: Theory: Hierarchical priors.
Performance considerations¶
Computational cost¶
All models are \(O(N \times G)\) per step. VCP adds cell-level structure; mixtures scale with \(K\).
Typical SVI step counts¶
| Model | Canonical / mean probs | Mean odds | Unconstrained |
|---|---|---|---|
| NBDM, ZINB | 50k--100k | 25k--50k | 100k--200k |
| NBVCP, ZINBVCP | 100k--150k | 50k--100k | 150k--300k |
| Mixture | 150k--300k | 100k--200k | 300k--500k |
Validate choices with model comparison and the Theory section for full derivations.