Bayesian Model Comparison¶
SCRIBE offers several model families---NBDM, hierarchical, ZINB, VCP, mixtures---and a key practical question is: which model should I use for my data? Rather than relying on heuristics or fixed recipes, SCRIBE provides a principled Bayesian framework that ranks models by their out-of-sample predictive accuracy, quantifies the uncertainty in that ranking, and can even construct optimal model ensembles.
This page develops the theory behind WAIC, PSIS-LOO, pairwise comparison, and model stacking. For API usage and code examples, see the Model Comparison guide.
The central quantity: expected log predictive density¶
All model comparison in SCRIBE revolves around the expected log predictive density (elpd), which measures how well a fitted model predicts new, unseen data drawn from the true data-generating process \(p_t\):
where the posterior predictive distribution averages the likelihood over the posterior:
Since \(p_t\) is unknown, we estimate elpd from the observed data.
Leave-one-out cross-validation¶
The gold standard is leave-one-out cross-validation (LOO-CV):
where \(y_{-i}\) is the dataset with observation \(i\) removed. Each term is the log predictive density for the held-out observation:
Exact LOO-CV requires \(n\) separate model fits---prohibitive for single-cell datasets with thousands of cells. SCRIBE therefore uses two efficient approximations that compute elpd from a single posterior inference.
WAIC: analytical LOO approximation¶
The Widely Applicable Information Criterion (WAIC; Watanabe, 2010) is a fast, fully analytical approximation to LOO-CV that requires only the posterior samples already available after fitting.
Log pointwise predictive density¶
The log pointwise predictive density (lppd) sums the predictive probability of each observed data point, averaged over the posterior:
where \(\{\theta^{(s)}\}_{s=1}^{S}\) are the posterior samples. In practice, the inner sum is evaluated with log-sum-exp for numerical stability.
Effective number of parameters¶
WAIC penalizes model complexity through an overfitting correction \(p_{\text{waic}}\) that estimates the effective number of parameters. The preferred version (Gelman et al., 2014) uses the variance of the log-likelihood across posterior draws:
Each term measures how much the posterior disagrees about observation \(i\): high variance means the model is "spending" parameters on that data point.
The WAIC criterion¶
Lower WAIC is better. The \(-2\) scaling places WAIC on the same deviance scale as AIC for easy comparison.
When WAIC can mislead
WAIC is asymptotically equivalent to LOO-CV under mild regularity conditions. However, it can be unreliable when the posterior has heavy tails or is dominated by a few influential observations---precisely the situation flagged by PSIS-LOO's built-in diagnostic.
PSIS-LOO: the recommended criterion¶
Pareto-Smoothed Importance Sampling LOO (PSIS-LOO; Vehtari et al., 2017) is more robust than WAIC and comes with a per-observation reliability diagnostic.
Importance sampling for LOO¶
The key insight is that the LOO posterior \(p(\theta \mid y_{-i})\) can be obtained from the full-data posterior via importance weighting:
This gives raw importance weights \(w_i^{(s)} = 1 / p(y_i \mid \theta^{(s)})\) that reweight the full posterior to approximate the LOO posterior---without refitting.
The heavy-tail problem¶
Some observations are so influential that the raw weights are extremely variable, making the IS estimator unreliable. PSIS stabilizes this by fitting a generalized Pareto distribution (GPD) to the upper tail of the weight distribution and replacing the noisy empirical tail with smooth GPD quantiles.
The Pareto k diagnostic¶
The GPD shape parameter \(\hat{k}\) directly measures how heavy the tail is for each observation:
| \(\hat{k}\) range | Interpretation |
|---|---|
| < 0.5 | Excellent --- reliable estimate |
| 0.5 -- 0.7 | Acceptable --- usable but worth monitoring |
| >= 0.7 | Problematic --- IS approximation unreliable |
In single-cell data, high \(\hat{k}\) values typically correspond to cells with unusually high or low total UMI counts, or rare cell types that are highly influential for the posterior.
PSIS-LOO elpd¶
After Pareto smoothing the weights, the LOO elpd is:
with associated criterion \(\text{LOO-IC} = -2\,\text{elpd}_{\text{psis-loo}}\).
Default recommendation
Use PSIS-LOO by default. It achieves lower bias than WAIC for realistic posterior geometries and provides per-observation diagnostics. Use WAIC only for quick exploratory comparisons or when data is very large and PSIS-LOO is too slow.
Pairwise comparison with uncertainty¶
Given two models \(M_A\) and \(M_B\) evaluated on the same data, define the pointwise elpd difference:
where \(l_{i,\cdot}\) is the pointwise LOO log predictive density. The total difference and its standard error are:
The ratio \(z_{AB} = \Delta\text{elpd}_{AB}\, /\, \widehat{\text{SE}}\) gives a scale-free signal-to-noise measure:
| \(|z_{AB}|\) | Interpretation | |---------------|----------------| | Much greater than 1 | Strong evidence for one model | | Close to 1 | Difference is not practically meaningful |
Not a p-value
The z-score is a descriptive signal-to-noise ratio, not a frequentist test statistic. It does not have a calibrated Type I error interpretation.
Gene-level comparison¶
Beyond the cell-level ranking, SCRIBE computes per-gene elpd differences by evaluating the gene-level log-likelihood:
Applying WAIC or PSIS-LOO to these gene-level log-likelihoods reveals which genes benefit most from one model's added flexibility. For example, when comparing NBDM to the hierarchical model, genes with highly variable total counts across cells typically show the largest improvement.
Model stacking¶
Instead of choosing one winner, model stacking (Yao et al., 2018) constructs an optimal predictive ensemble. The stacking weights \(\underline{w}^* \in \Delta^{K-1}\) maximize the LOO log-score of the mixture:
The objective is concave (log of a linear function), so the unique optimum is found efficiently by standard convex solvers. Stacking weights differ from Bayesian model averaging: they optimize predictive performance on held-out data, making them more robust when models are misspecified or structurally similar.
For comparison, SCRIBE also provides pseudo-BMA weights:
which mimic the classical AIC weight formula.
Application to SCRIBE's model families¶
Pointwise log-likelihood¶
For the standard NBDM model with shared success probability \(\hat{p}\), the per-cell log-likelihood under posterior sample \(s\) factorizes into a total-count term and a composition term:
where \(u_{T,c} = \sum_g u_{gc}\) is the total UMI count and \(r_T^{(s)} = \sum_g r_g^{(s)}\).
For the hierarchical model with gene-specific \(p_g\), the NB-DM factorization breaks down (see the Hierarchical p theory page), and the per-cell log-likelihood becomes a sum over genes:
Both are computed automatically by SCRIBE's log_likelihood method and
stored as an \(S \times C\) matrix for downstream comparison.
What PSIS-LOO detects¶
Comparing NBDM to the hierarchical model with PSIS-LOO answers: does allowing gene-specific \(p_g\) improve out-of-sample predictions at the cell level? A positive \(\Delta\text{elpd}\) with \(|z| > 2\) indicates meaningful improvement. Conversely, when the posterior of the hierarchical model concentrates \(\sigma_p\) near zero, both models predict equally well.
Summary of key quantities¶
| Quantity | Formula | What it measures |
|---|---|---|
| lppd | \(\sum_i \log \frac{1}{S}\sum_s \exp(\ell_i^{(s)})\) | In-sample predictive fit |
| \(p_{\text{waic}}\) | \(\sum_i \widehat{\text{Var}}_s(\ell_i^{(s)})\) | Effective number of parameters |
| WAIC | \(-2(\text{lppd} - p_{\text{waic}})\) | Penalized predictive fit (lower is better) |
| elpd (PSIS-LOO) | \(\sum_i \log \hat{p}_{\text{psis}}(y_i \mid y_{-i})\) | Out-of-sample predictive fit |
| \(\hat{k}_i\) | GPD shape on IS weights | Per-observation reliability |
| \(\Delta\text{elpd}_{AB}\) | \(\sum_i (l_{i,A} - l_{i,B})\) | Pairwise model difference |
| Stacking \(\underline{w}^*\) | Maximizes LOO mixture log-score | Optimal ensemble weights |
Using model comparison in SCRIBE¶
Quick example¶
from scribe.mc import compare_models
# Fit two models on the same data
results_nbdm = scribe.fit(adata)
results_hier = scribe.fit(adata, prob_prior="horseshoe")
# Compare
mc = compare_models(
[results_nbdm, results_hier],
counts=adata.X,
model_names=["NBDM", "Hierarchical"],
gene_names=adata.var_names.tolist(),
compute_gene_liks=True,
)
# Ranked summary table
print(mc.summary())
# PSIS-LOO diagnostics (k-hat values)
print(mc.diagnostics())
# Per-gene comparison
gene_df = mc.gene_level_comparison("NBDM", "Hierarchical")
print(gene_df.head(20))
# Stacking weights for optimal ensemble
w = mc.stacking_weights()
print(w)
Choosing a criterion¶
| Scenario | Recommended |
|---|---|
| General use | PSIS-LOO |
| Quick exploratory comparison | WAIC |
| Very large dataset, PSIS-LOO too slow | WAIC |
| Need per-observation diagnostics | PSIS-LOO (\(\hat{k}\) values) |
| No single model dominates | Model stacking |
API quick reference¶
| Function | What it does |
|---|---|
compare_models |
High-level entry point: fits, ranks, and returns results |
waic |
JAX-accelerated WAIC from log-likelihood matrix |
compute_psis_loo |
PSIS-LOO with Pareto fitting and \(\hat{k}\) diagnostics |
gene_level_comparison |
Per-gene elpd differences with SE and z-scores |
compute_stacking_weights |
Optimal ensemble weights via convex optimization |
pseudo_bma_weights |
AIC-style approximate model weights |
For full API details, see the Model Comparison guide and the API Reference.