Skip to content

Bayesian Model Comparison

SCRIBE offers several model families---NBDM, hierarchical, ZINB, VCP, mixtures---and a key practical question is: which model should I use for my data? Rather than relying on heuristics or fixed recipes, SCRIBE provides a principled Bayesian framework that ranks models by their out-of-sample predictive accuracy, quantifies the uncertainty in that ranking, and can even construct optimal model ensembles.

This page develops the theory behind WAIC, PSIS-LOO, pairwise comparison, and model stacking. For API usage and code examples, see the Model Comparison guide.


The central quantity: expected log predictive density

All model comparison in SCRIBE revolves around the expected log predictive density (elpd), which measures how well a fitted model predicts new, unseen data drawn from the true data-generating process \(p_t\):

\[ \text{elpd} = \int \log p(\tilde{y} \mid y)\, p_t(\tilde{y})\, d\tilde{y}, \]

where the posterior predictive distribution averages the likelihood over the posterior:

\[ p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta)\, p(\theta \mid y)\, d\theta. \]

Since \(p_t\) is unknown, we estimate elpd from the observed data.

Leave-one-out cross-validation

The gold standard is leave-one-out cross-validation (LOO-CV):

\[ \text{elpd}_{\text{loo}} = \sum_{i=1}^{n} \log p(y_i \mid y_{-i}), \]

where \(y_{-i}\) is the dataset with observation \(i\) removed. Each term is the log predictive density for the held-out observation:

\[ p(y_i \mid y_{-i}) = \int p(y_i \mid \theta)\, p(\theta \mid y_{-i})\, d\theta. \]

Exact LOO-CV requires \(n\) separate model fits---prohibitive for single-cell datasets with thousands of cells. SCRIBE therefore uses two efficient approximations that compute elpd from a single posterior inference.


WAIC: analytical LOO approximation

The Widely Applicable Information Criterion (WAIC; Watanabe, 2010) is a fast, fully analytical approximation to LOO-CV that requires only the posterior samples already available after fitting.

Log pointwise predictive density

The log pointwise predictive density (lppd) sums the predictive probability of each observed data point, averaged over the posterior:

\[ \text{lppd} = \sum_{i=1}^{n} \log\!\left( \frac{1}{S}\sum_{s=1}^{S} p(y_i \mid \theta^{(s)}) \right), \]

where \(\{\theta^{(s)}\}_{s=1}^{S}\) are the posterior samples. In practice, the inner sum is evaluated with log-sum-exp for numerical stability.

Effective number of parameters

WAIC penalizes model complexity through an overfitting correction \(p_{\text{waic}}\) that estimates the effective number of parameters. The preferred version (Gelman et al., 2014) uses the variance of the log-likelihood across posterior draws:

\[ p_{\text{waic}} = \sum_{i=1}^{n} \widehat{\text{Var}}_s\!\bigl(\log p(y_i \mid \theta^{(s)})\bigr). \]

Each term measures how much the posterior disagrees about observation \(i\): high variance means the model is "spending" parameters on that data point.

The WAIC criterion

\[ \text{elppd}_{\text{waic}} = \text{lppd} - p_{\text{waic}}, \qquad \text{WAIC} = -2\,\text{elppd}_{\text{waic}}. \]

Lower WAIC is better. The \(-2\) scaling places WAIC on the same deviance scale as AIC for easy comparison.

When WAIC can mislead

WAIC is asymptotically equivalent to LOO-CV under mild regularity conditions. However, it can be unreliable when the posterior has heavy tails or is dominated by a few influential observations---precisely the situation flagged by PSIS-LOO's built-in diagnostic.


Pareto-Smoothed Importance Sampling LOO (PSIS-LOO; Vehtari et al., 2017) is more robust than WAIC and comes with a per-observation reliability diagnostic.

Importance sampling for LOO

The key insight is that the LOO posterior \(p(\theta \mid y_{-i})\) can be obtained from the full-data posterior via importance weighting:

\[ p(\theta \mid y_{-i}) \propto \frac{p(\theta \mid y)}{p(y_i \mid \theta)}. \]

This gives raw importance weights \(w_i^{(s)} = 1 / p(y_i \mid \theta^{(s)})\) that reweight the full posterior to approximate the LOO posterior---without refitting.

The heavy-tail problem

Some observations are so influential that the raw weights are extremely variable, making the IS estimator unreliable. PSIS stabilizes this by fitting a generalized Pareto distribution (GPD) to the upper tail of the weight distribution and replacing the noisy empirical tail with smooth GPD quantiles.

The Pareto k diagnostic

The GPD shape parameter \(\hat{k}\) directly measures how heavy the tail is for each observation:

\(\hat{k}\) range Interpretation
< 0.5 Excellent --- reliable estimate
0.5 -- 0.7 Acceptable --- usable but worth monitoring
>= 0.7 Problematic --- IS approximation unreliable

In single-cell data, high \(\hat{k}\) values typically correspond to cells with unusually high or low total UMI counts, or rare cell types that are highly influential for the posterior.

PSIS-LOO elpd

After Pareto smoothing the weights, the LOO elpd is:

\[ \text{elpd}_{\text{psis-loo}} = \sum_{i=1}^{n} \log \hat{p}_{\text{psis}}(y_i \mid y_{-i}), \]

with associated criterion \(\text{LOO-IC} = -2\,\text{elpd}_{\text{psis-loo}}\).

Default recommendation

Use PSIS-LOO by default. It achieves lower bias than WAIC for realistic posterior geometries and provides per-observation diagnostics. Use WAIC only for quick exploratory comparisons or when data is very large and PSIS-LOO is too slow.


Pairwise comparison with uncertainty

Given two models \(M_A\) and \(M_B\) evaluated on the same data, define the pointwise elpd difference:

\[ d_i = l_{i,A} - l_{i,B}, \]

where \(l_{i,\cdot}\) is the pointwise LOO log predictive density. The total difference and its standard error are:

\[ \Delta\text{elpd}_{AB} = \sum_{i=1}^{n} d_i, \qquad \widehat{\text{SE}} = \sqrt{\sum_{i=1}^{n}(d_i - \bar{d})^2}. \]

The ratio \(z_{AB} = \Delta\text{elpd}_{AB}\, /\, \widehat{\text{SE}}\) gives a scale-free signal-to-noise measure:

| \(|z_{AB}|\) | Interpretation | |---------------|----------------| | Much greater than 1 | Strong evidence for one model | | Close to 1 | Difference is not practically meaningful |

Not a p-value

The z-score is a descriptive signal-to-noise ratio, not a frequentist test statistic. It does not have a calibrated Type I error interpretation.


Gene-level comparison

Beyond the cell-level ranking, SCRIBE computes per-gene elpd differences by evaluating the gene-level log-likelihood:

\[ \ell_g^{(s)} = \sum_{c=1}^{C} \log p(u_{gc} \mid \theta^{(s)}). \]

Applying WAIC or PSIS-LOO to these gene-level log-likelihoods reveals which genes benefit most from one model's added flexibility. For example, when comparing NBDM to the hierarchical model, genes with highly variable total counts across cells typically show the largest improvement.


Model stacking

Instead of choosing one winner, model stacking (Yao et al., 2018) constructs an optimal predictive ensemble. The stacking weights \(\underline{w}^* \in \Delta^{K-1}\) maximize the LOO log-score of the mixture:

\[ \underline{w}^* = \underset{\underline{w}\,\in\,\Delta^{K-1}}{\arg\max} \sum_{i=1}^{n}\log\sum_{k=1}^{K} w_k\,\hat{p}(y_i \mid y_{-i}, M_k). \]

The objective is concave (log of a linear function), so the unique optimum is found efficiently by standard convex solvers. Stacking weights differ from Bayesian model averaging: they optimize predictive performance on held-out data, making them more robust when models are misspecified or structurally similar.

For comparison, SCRIBE also provides pseudo-BMA weights:

\[ w_k^{\text{BMA}} \propto \exp\!\bigl(-\tfrac{1}{2}\,\text{WAIC}_k\bigr), \]

which mimic the classical AIC weight formula.


Application to SCRIBE's model families

Pointwise log-likelihood

For the standard NBDM model with shared success probability \(\hat{p}\), the per-cell log-likelihood under posterior sample \(s\) factorizes into a total-count term and a composition term:

\[ \ell_c^{(s)} = \log \text{NB}(u_{T,c} \mid r_T^{(s)}, \hat{p}^{(s)}) + \log \text{DM}(\underline{u}_c \mid u_{T,c}, \underline{r}^{(s)}), \]

where \(u_{T,c} = \sum_g u_{gc}\) is the total UMI count and \(r_T^{(s)} = \sum_g r_g^{(s)}\).

For the hierarchical model with gene-specific \(p_g\), the NB-DM factorization breaks down (see the Hierarchical p theory page), and the per-cell log-likelihood becomes a sum over genes:

\[ \ell_c^{(s)} = \sum_{g=1}^{G} \log \text{NB}(u_{gc} \mid r_g^{(s)}, p_g^{(s)}). \]

Both are computed automatically by SCRIBE's log_likelihood method and stored as an \(S \times C\) matrix for downstream comparison.

What PSIS-LOO detects

Comparing NBDM to the hierarchical model with PSIS-LOO answers: does allowing gene-specific \(p_g\) improve out-of-sample predictions at the cell level? A positive \(\Delta\text{elpd}\) with \(|z| > 2\) indicates meaningful improvement. Conversely, when the posterior of the hierarchical model concentrates \(\sigma_p\) near zero, both models predict equally well.


Summary of key quantities

Quantity Formula What it measures
lppd \(\sum_i \log \frac{1}{S}\sum_s \exp(\ell_i^{(s)})\) In-sample predictive fit
\(p_{\text{waic}}\) \(\sum_i \widehat{\text{Var}}_s(\ell_i^{(s)})\) Effective number of parameters
WAIC \(-2(\text{lppd} - p_{\text{waic}})\) Penalized predictive fit (lower is better)
elpd (PSIS-LOO) \(\sum_i \log \hat{p}_{\text{psis}}(y_i \mid y_{-i})\) Out-of-sample predictive fit
\(\hat{k}_i\) GPD shape on IS weights Per-observation reliability
\(\Delta\text{elpd}_{AB}\) \(\sum_i (l_{i,A} - l_{i,B})\) Pairwise model difference
Stacking \(\underline{w}^*\) Maximizes LOO mixture log-score Optimal ensemble weights

Using model comparison in SCRIBE

Quick example

from scribe.mc import compare_models

# Fit two models on the same data
results_nbdm = scribe.fit(adata)
results_hier = scribe.fit(adata, prob_prior="horseshoe")

# Compare
mc = compare_models(
    [results_nbdm, results_hier],
    counts=adata.X,
    model_names=["NBDM", "Hierarchical"],
    gene_names=adata.var_names.tolist(),
    compute_gene_liks=True,
)

# Ranked summary table
print(mc.summary())

# PSIS-LOO diagnostics (k-hat values)
print(mc.diagnostics())

# Per-gene comparison
gene_df = mc.gene_level_comparison("NBDM", "Hierarchical")
print(gene_df.head(20))

# Stacking weights for optimal ensemble
w = mc.stacking_weights()
print(w)

Choosing a criterion

Scenario Recommended
General use PSIS-LOO
Quick exploratory comparison WAIC
Very large dataset, PSIS-LOO too slow WAIC
Need per-observation diagnostics PSIS-LOO (\(\hat{k}\) values)
No single model dominates Model stacking

API quick reference

Function What it does
compare_models High-level entry point: fits, ranks, and returns results
waic JAX-accelerated WAIC from log-likelihood matrix
compute_psis_loo PSIS-LOO with Pareto fitting and \(\hat{k}\) diagnostics
gene_level_comparison Per-gene elpd differences with SE and z-scores
compute_stacking_weights Optimal ensemble weights via convex optimization
pseudo_bma_weights AIC-style approximate model weights

For full API details, see the Model Comparison guide and the API Reference.