Model Selection¶
SCRIBE provides a family of probabilistic models for scRNA-seq data, all built
on the foundational Negative Binomial (NB) distribution dictated by the
biophysics of transcription and mRNA capture. You choose a likelihood by
setting variable_capture and zero_inflation on scribe.fit(). The
default model is NBVCP (variable capture on). Setting
variable_capture=False gives NBDM; adding zero_inflation=True gives
ZINBVCP (or ZINB when combined with variable_capture=False). The same four variants are still
available as model="nbdm" / "nbvcp" / "zinb" / "zinbvcp" if you prefer a
single string. Optional extensions (BNB overdispersion, mixture components) use
other scribe.fit() arguments.
Default: variable capture (NBVCP)¶
In typical scRNA-seq data, per-cell total UMI counts vary widely across the
experiment. SCRIBE defaults to NBVCP (model="nbvcp"), which absorbs
library-size variation into a cell-specific technical capture channel.
Empirically, we have not yet encountered a dataset that does not benefit
from this extension.
# Default: variable capture is on
results = scribe.fit(adata)
# Add a low-rank guide for gene-gene correlations
results = scribe.fit(adata, guide_rank=64)
The guide_rank parameter adds a low-rank component to the variational
posterior, giving SCRIBE a parameter-efficient way to capture gene-gene
correlations that a mean-field guide would miss. A rank of 64 is a good
initial value (you can always increase the rank or switch to a normalizing flow
guide if you want a more expressive posterior); see Variational Guide
Families for details.
When NBDM is reasonable (variable_capture=False, same as model="nbdm"):
total UMIs per cell are very homogeneous (e.g. max/min total UMI within
roughly a factor of two, after basic QC), so a shared effective capture is a
good approximation.
Zero inflation is often optional
Explicit zero inflation (zero_inflation=True; ZINB / ZINBVCP,
or model="zinb" / "zinbvcp") is not always necessary. Apparent "excess
zeros" frequently arise because low capture cells produce many zeros
across genes. Fitting variable capture first (variable_capture=True;
NBVCP) often explains those zeros without a separate dropout layer. Add
zero inflation only when diagnostics and model comparison show a clear
gain after a good VCP fit.
The model family at a glance¶
graph TD
NB["Negative Binomial<br/><b>model='nbdm'</b> (variable_capture=False)<br/><i>base: r_g, p</i>"]
ZINB["Zero-Inflated NB<br/><b>model='zinb'</b> (zero_inflation=True)<br/><i>+ gate_g</i>"]
NBVCP["NB + Variable Capture<br/><b>model='nbvcp'</b> (default)<br/><i>+ nu_c</i>"]
ZINBVCP["ZINB + Variable Capture<br/><b>model='zinbvcp'</b> (variable_capture=True, zero_inflation=True)<br/><i>+ gate_g, nu_c</i>"]
NB -->|"+ zero inflation"| ZINB
NB -->|"+ variable capture"| NBVCP
ZINB -->|"+ variable capture"| ZINBVCP
NBVCP -->|"+ zero inflation"| ZINBVCP
NB -.->|"+ BNB overdispersion"| BNB["Any model +<br/><b>overdispersion='bnb'</b><br/><i>+ kappa_g</i>"]
NB -.->|"+ mixture"| MIX["Any model +<br/><b>n_components=K</b><br/><i>K sets of gene params</i>"]
Decision guide¶
graph TD
Start["Assess total UMI per cell"] --> Qtot{"Max total UMI / min total UMI<br/>roughly within 2x?"}
Qtot -->|Yes, very tight| Base["NBDM may suffice<br/>model='nbdm' (variable_capture=False)"]
Qtot -->|No, wider spread| VCP0["Use default NBVCP<br/>model='nbvcp' (default)"]
Base --> Qzi{"Still poor fit / excess zeros<br/>after NBDM + PPC?"}
Qzi -->|Try VCP first| VCP0
Qzi -->|Yes, after VCP| ZI["Consider ZINB or ZINBVCP<br/>(zero_inflation=True, ± VCP)"]
VCP0 --> Qzi2{"Excess zeros remain<br/>after good VCP fit?"}
Qzi2 -->|No| Qtail{"Heavy tails remain<br/>in PPC after VCP?"}
Qzi2 -->|Yes| ZINBV["ZINBVCP<br/>model='zinbvcp' (variable_capture=True, zero_inflation=True)"]
ZINBV --> Qtail
Qtail -->|Yes| BNB["Add BNB overdispersion<br/>overdispersion='bnb'"]
Qtail -->|No| Qmix{"Multiple cell<br/>populations?"}
BNB --> Qmix
Qmix -->|Yes| Mix["Add mixture<br/>n_components=K"]
Qmix -->|No| Done["Tune with model comparison"]
Mix --> Done
Use model comparison and goodness-of-fit diagnostics to justify adding ZI, BNB, or extra mixture components.
Variable capture probability (NBVCP)¶
When cells differ in mRNA capture efficiency, the same underlying expression can produce very different total UMIs. The VCP extension introduces a cell-specific capture probability \(\nu^{(c)}\) that modifies the base success probability:
so \(u_g^{(c)} \mid r_g, \hat{p}^{(c)}\) is distributed as \(\text{NB}(r_g, \hat{p}^{(c)})\). Genes are modelled independently given \(\nu^{(c)}\) (no Dirichlet-Multinomial factorization in this likelihood).
For many cells, amortized capture inference scales better:
See also: Theory: Anchoring priors (capture identifiability and priors).
Base: Negative Binomial (NBDM)¶
When total UMIs per cell are homogeneous, a single effective capture shared across cells is often adequate. The NB likelihood for UMIs is
with gene-specific \(r_g\). When \(\hat{p}\) is shared across genes, the joint distribution factorizes into a Negative Binomial for totals and a Dirichlet-Multinomial for compositions---principled normalization without ad-hoc library-size scaling.
See also: Theory: Dirichlet-Multinomial.
Zero inflation (ZINB)¶
Adds a per-gene gate \(\pi_g\) for technical dropout:
Prefer variable_capture=True first when library sizes vary; add
zero_inflation=True (ZINB or ZINBVCP, or model="zinb" / "zinbvcp") only
when the data still need an explicit dropout layer after a strong VCP fit.
Both: ZINBVCP¶
Combines zero inflation and variable capture:
Highest flexibility and cost; use when both mechanisms are supported by diagnostics.
# ZINBVCP; equivalent to model="zinbvcp"
results = scribe.fit(adata, variable_capture=True, zero_inflation=True)
BNB overdispersion¶
Beta Negative Binomial adds heavy tails via a Beta randomization of the NB
success probability and an extra \(\kappa_g\). Requires unconstrained=True.
Fit variable capture first
What looks like heavy tails in the raw data can often be explained by variable capture efficiency: cells with high capture produce counts in the upper tail while low-capture cells pile up near zero, mimicking a heavier-tailed distribution. Fit an NBVCP model first, then check the posterior predictive distribution. Add BNB only when genuine per-gene excess dispersion remains after accounting for capture.
results = scribe.fit(
adata,
overdispersion="bnb",
unconstrained=True,
# Any likelihood: default NBVCP, or e.g. variable_capture=False,
# zero_inflation=True, or both (same as model="nbdm" / "nbvcp" / …).
)
See also: Theory: Beta Negative Binomial.
Mixture components¶
Any of the above supports n_components=K for subpopulations. Gene-specific
parameters (\(r_g\), and \(\pi_g\) if applicable) are usually
component-specific; global \(p\) and cell-specific \(\nu^{(c)}\) stay shared as
in the base construction.
results = scribe.fit(
adata,
variable_capture=True,
n_components=3,
n_steps=150_000,
)
assignments = results.cell_type_assignments(counts=adata.X)
results = scribe.fit(
adata,
zero_inflation=True,
n_components=3,
mixture_params="mean", # only expression-level param varies by component
)
See also: Results class (mixture assignments and components).
Comparison table¶
| Model | Zero Inflated | Variable Capture | BNB | Mixture | Best For |
|---|---|---|---|---|---|
"nbdm" (variable_capture=False) |
-- | -- | opt. | opt. | Tight total-UMI distribution (~within 2x) |
"nbvcp" (default) |
-- | Yes | opt. | opt. | Typical data; heterogeneous library sizes |
"zinb" (zero_inflation=True) |
Yes | -- | opt. | opt. | Excess zeros after VCP ruled out / no VCP |
"zinbvcp" (variable_capture=True, zero_inflation=True) |
Yes | Yes | opt. | opt. | Strong evidence for both ZI and VCP |
"opt." = add overdispersion="bnb" or n_components=K.
Parameterizations¶
Each model can be parameterized in three ways (how the NB parameters are
represented internally). SCRIBE names them canonical, mean probs, and
mean odds; the parameterization= string uses the codes below (aliases in
parentheses).
| Name | parameterization= |
Samples | Derives | Best For |
|---|---|---|---|---|
| Canonical | "canonical" (alias "standard") |
\(p, r\) | -- | Direct interpretation |
| Mean probs | "mean_prob" (alias "linked") |
\(p, \mu\) | \(r = \mu(1-p)/p\) | Couples mean and success probability |
| Mean odds | "mean_odds" (alias "odds_ratio") |
\(\phi, \mu\) | \(p = 1/(1+\phi)\), \(r = \mu\phi\) | Stable when \(p\) is near 1 |
results = scribe.fit(
adata,
variable_capture=True,
parameterization="mean_prob",
)
# equivalent: parameterization="linked"
For a complete mapping of every parameter name to its symbol, domain, and equation context, see the Parameter Reference.
Constrained (default) vs unconstrained (Normal + transforms; required for hierarchical priors and BNB):
Hierarchical priors¶
With unconstrained=True, you can use hierarchical priors on gene-specific
parameters (\(\mu\), \(p\), gate, overdispersion):
results = scribe.fit(
adata,
unconstrained=True,
expression_prior="horseshoe",
prob_prior="gaussian",
)
See also: Theory: Hierarchical priors.
Performance considerations¶
Computational cost¶
All models are \(O(N \times G)\) per step. VCP adds cell-level structure; mixtures scale with \(K\).
Typical SVI step counts¶
| Model | Canonical / mean probs | Mean odds | Unconstrained |
|---|---|---|---|
| NBDM, ZINB | 50k--100k | 25k--50k | 100k--200k |
| NBVCP, ZINBVCP | 100k--150k | 50k--100k | 150k--300k |
| Mixture | 150k--300k | 100k--200k | 300k--500k |
Validate choices with model comparison and the Theory section for full derivations.