Skip to content

Welcome to SCRIBE

SCRIBE (Single-Cell RNA-seq Inference with Bayesian Estimation) is a comprehensive Python package for Bayesian analysis of single-cell RNA sequencing (scRNA-seq) data. Built on JAX and NumPyro, SCRIBE provides a unified framework for probabilistic modeling, variational inference, uncertainty quantification, differential expression, and model comparison in single-cell genomics.

Generative Model

SCRIBE Generative Model
Biophysical generative model underlying SCRIBE. Transcription and degradation set the steady-state mRNA content per gene, giving rise to a Negative Binomial distribution. Binomial capture sub-sampling yields the observed UMI counts.

SCRIBE is grounded in a biophysical generative model of scRNA-seq count data. Transcription (rate \(b\)) and degradation (rate \(\gamma\)) set the steady-state mRNA content per gene, giving rise to a Negative Binomial distribution over true molecular counts \(m_g\) with parameters \(r_g\) and \(p_g\). During library preparation each molecule is independently captured with cell-specific probability \(\nu^{(c)}\), so the observed UMI count \(u_g\) follows a Binomial sub-sampling of \(m_g\). Marginalizing over the latent counts yields a Negative Binomial likelihood for the observations with an effective success probability \(\hat{p}_g^{(c)}\) that absorbs the capture efficiency.

For the full mathematical derivation, see the Theory section.

Why SCRIBE?

  • Unified Framework: Single scribe.fit() interface for SVI, MCMC, and VAE inference methods
  • Compositional Models: Four constructive likelihoods -- from the base Negative Binomial up to zero-inflated models with variable capture probability
  • Compositional Differential Expression: Bayesian DE in log-ratio coordinates with proper uncertainty propagation and error control (lfsr, PEFP)
  • Model Comparison: WAIC, PSIS-LOO, stacking weights, and goodness-of-fit diagnostics for principled model selection
  • GPU Accelerated: JAX-based implementation with automatic GPU support
  • Flexible Architecture: Three parameterizations, constrained/unconstrained modes, hierarchical priors, horseshoe sparsity, and normalizing flows
  • Scalable: From small experiments to large-scale atlases with mini-batch support
  • Production-Ready CLI: scribe-infer provides reproducible, config-driven inference with SLURM integration and automatic covariate-split orchestration; scribe-visualize generates diagnostic plots for any completed run

Key Features

  • Three Inference Methods:
    • SVI for speed and scalability
    • MCMC (NUTS) for exact Bayesian inference
    • VAE for representation learning with normalizing flow priors
  • Constructive Likelihood System: Negative Binomial as the base, extended with zero inflation and/or variable capture probability
  • Multiple Parameterizations: Canonical, mean probs, and mean odds (with linked / odds_ratio aliases), constrained or unconstrained priors
  • Advanced Guide Families: Mean-field, low-rank, joint low-rank, and amortized variational guides
  • Mixture Models: K-component mixtures for cell type discovery with annotation-guided priors
  • Hierarchical Priors: Gene-specific and dataset-level hierarchical structures with optional horseshoe sparsity
  • Bayesian Differential Expression: Parametric, empirical (Monte Carlo), and shrinkage (empirical Bayes) methods in CLR/ILR coordinates
  • Model Comparison: WAIC, PSIS-LOO, stacking, per-gene elpd, and goodness-of-fit via randomized quantile residuals
  • Seamless Integration: Works with AnnData and the scanpy ecosystem

Model Construction Space

SCRIBE models are built compositionally. The likelihood is constructed by layering extensions on top of a base Negative Binomial (NB) model, then configured with a parameterization, constraint mode, optional extensions, and an inference method:

graph TD
    subgraph likelihood ["1 - Likelihood Construction"]
        NB["Negative Binomial<br/><i>base model</i>"]
        ZINB["Zero-Inflated NB"]
        NBcapture["NB + variable capture"]
        ZINBcapture["ZINB + variable capture"]
        NB -->|"+ zero inflation"| ZINB
        NB -->|"+ variable capture"| NBcapture
        ZINB -->|"+ variable capture"| ZINBcapture
        NBcapture -->|"+ zero inflation"| ZINBcapture
    end

    subgraph parameterization ["2 - Parameterization"]
        canonical["canonical<br/><i>sample p, r directly</i>"]
        meanProbs["mean_prob<br/><i>sample p, mu; derive r</i>"]
        meanOdds["mean_odds<br/><i>sample phi, mu; derive p, r</i>"]
    end

    subgraph constraint ["3 - Constraint Mode"]
        constrained["constrained<br/><i>Beta, LogNormal, BetaPrime</i>"]
        unconstr["unconstrained<br/><i>Normal + transforms</i>"]
    end

    subgraph extensions ["4 - Optional Extensions"]
        mixture["Mixture<br/><i>K components</i>"]
        hierarchical["Hierarchical Priors<br/><i>gene-specific p, gate</i>"]
        multiDataset["Multi-Dataset<br/><i>per-dataset parameters</i>"]
        horseshoe["Horseshoe<br/><i>sparsity priors</i>"]
        annotationPrior["Annotation Priors<br/><i>soft cell-type labels</i>"]
        bioCap["Biology-Informed<br/><i>capture prior</i>"]
    end

    subgraph infer ["5 - Inference Method"]
        SVI_node["SVI<br/><i>fast, scalable</i>"]
        MCMC_node["MCMC<br/><i>exact posterior</i>"]
        VAE_node["VAE<br/><i>learned representations</i>"]
    end

    subgraph guide ["6 - Guide Family"]
        meanField["Mean-Field"]
        lowRank["Low-Rank"]
        jointLowRank["Joint Low-Rank"]
        amortized["Amortized"]
        flows["Normalizing Flows<br/><i>VAE prior</i>"]
    end

    likelihood --> parameterization
    parameterization --> constraint
    constraint --> extensions
    extensions --> infer
    SVI_node --> guide
    VAE_node --> guide

This compositional design means you can combine 4 likelihoods x 3 parameterizations x 2 constraint modes as a starting point, then layer on mixture components, hierarchical priors, multi-dataset structure, and more.

Available Models

Likelihood Construction

SCRIBE's four likelihoods build on each other -- the base Negative Binomial model can be extended with zero inflation and/or variable capture probability:

Likelihood Code Construction Extra Parameters Best For
Negative Binomial "nbdm" Base model -- Very tight total-UMI distribution
NB + variable capture "nbvcp" NB + capture probability p_capture Typical heterogeneous library sizes
Zero-Inflated NB "zinb" NB + zero inflation gate Excess zeros after VCP ruled out
ZINB + variable capture "zinbvcp" ZINB + capture probability gate, p_capture Both ZI and VCP supported by diagnostics

Any of the above can be extended to mixture models with n_components=K for subpopulation analysis.

Parameterizations

Each likelihood can be parameterized in three ways:

Name parameterization= Aliases Core Derived When to Use
Canonical canonical standard p, r -- Direct interpretation
Mean probs mean_prob linked p, mu r = mu(1-p)/p Couples mean and p
Mean odds mean_odds odds_ratio phi, mu p = 1/(1+phi), r = mu*phi Stable when p is near 1

Constrained vs Unconstrained

Mode Prior Distributions Use Case
Constrained Beta, LogNormal, BetaPrime Default; interpretable parameters
Unconstrained Normal + sigmoid/exp transforms Optimization-friendly; required for hierarchical priors

Quick Start

import scribe
import scanpy as sc

# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")

# Default model includes variable capture; add low-rank guide for gene-gene correlations
results = scribe.fit(adata, guide_rank=64)

# Analyze results
posterior_samples = results.get_posterior_samples()

Customize with Simple Arguments

# Zero-inflated model with more optimization steps
results = scribe.fit(
    adata,
    zero_inflation=True,
    n_steps=100_000,
    batch_size=512,
)

# Linked parameterization with low-rank guide
results = scribe.fit(
    adata,
    model="nbdm",
    parameterization="linked",
    guide_rank=15,
)

# Mixture model for cell type discovery
results = scribe.fit(
    adata,
    zero_inflation=True,
    n_components=3,
    n_steps=150_000,
)

Choose Your Inference Method

Method Engine Precision Use Case
SVI Adam optimizer float32 Fast exploration, large datasets
MCMC NUTS sampler float64 Exact posterior, gold standard
VAE Encoder-decoder float32 Latent representations, embeddings
# Fast exploration with SVI (default)
svi_results = scribe.fit(adata, zero_inflation=True, n_steps=75_000)

# Exact inference with MCMC
mcmc_results = scribe.fit(
    adata,
    model="nbdm",
    inference_method="mcmc",
    n_samples=3000,
    n_chains=4,
)

# Representation learning with VAE
vae_results = scribe.fit(
    adata,
    model="nbdm",
    inference_method="vae",
    n_steps=50_000,
)

Differential Expression

SCRIBE provides a fully Bayesian differential expression framework that respects the compositional nature of scRNA-seq data. All comparisons are performed in log-ratio coordinates (CLR/ILR), propagating full posterior uncertainty.

Method Description Use Case
Parametric Analytic Gaussian in ALR space Fast, requires low-rank logistic-normal fit
Empirical Monte Carlo CLR differences Assumption-free, from posterior samples
Shrinkage Empirical Bayes scale-mixture prior Improved per-gene inference, borrows strength across genes
import jax.numpy as jnp
from scribe import compare

# Fit two conditions (default likelihood; 3-component mixture)
results_ctrl = scribe.fit(adata_ctrl, n_components=3)
results_treat = scribe.fit(adata_treat, n_components=3)

# Empirical DE between component 0 across conditions
de = compare(
    results_treat, results_ctrl,
    method="empirical",
    component_A=0, component_B=0,
)

# Gene-level results with practical significance threshold
gene_results = de.gene_level(tau=jnp.log(1.1))

# Call DE genes controlling false sign rate
is_de = de.call_genes(lfsr_threshold=0.05)

Full guide: Differential Expression

Model Comparison

Principled Bayesian model comparison with WAIC, PSIS-LOO, stacking weights, per-gene elpd differences, and goodness-of-fit diagnostics:

from scribe import compare_models

mc = compare_models(
    [results_nb, results_hierarchical],
    counts=counts,
    model_names=["NB", "Hierarchical"],
    gene_names=gene_names,
)

# Ranked comparison table
print(mc.summary())

# Per-gene elpd differences
gene_df = mc.gene_level_comparison("NB", "Hierarchical")

Full guide: Model Comparison

Getting Started

  • Installation


    Install SCRIBE and set up your environment

    Installation guide

  • Quick Overview


    Understand the probabilistic approach behind SCRIBE

    Quick overview

  • Quickstart


    Run your first inference in minutes

    Quickstart tutorial

  • Theory


    Mathematical foundations of the SCRIBE models

    Theory

  • Model Selection


    Choose the right model for your data

    Model Selection

  • User Guide


    Inference methods, DE, model comparison, and more

    User guide

  • scribe-infer CLI


    Reproducible, config-driven inference with SLURM integration

    CLI guide

  • scribe-visualize CLI


    Post-inference diagnostic plots with recursive and SLURM support

    Visualization guide

  • API Reference


    Full reference for all modules and classes

    API reference