Differential Expression¶

SCRIBE provides a fully Bayesian framework for differential expression (DE) analysis that works in compositional space, accounts for gene-gene correlations, and provides true posterior probabilities instead of frequentist p-values.

Overview¶

The DE framework is:

Compositional -- works in CLR (Centered Log-Ratio) or ILR (Isometric Log-Ratio) space, ensuring reference-invariant results
Correlation-aware -- leverages low-rank covariance structure to account for gene-gene correlations
Fully Bayesian -- provides exact posterior probabilities under the Gaussian assumption (parametric) or via Monte Carlo counting (empirical)
Computationally efficient -- all operations are \(O(kG)\) or \(O(k^2 G)\) where \(k \ll G\), avoiding \(O(G^3)\) dense matrix operations

Quick Start¶

from scribe.de import compare

# Fit models for two conditions
results_A = scribe.fit(adata_treatment, model="nbdm")
results_B = scribe.fit(adata_control, model="nbdm")

# Create comparison (empirical method, recommended)
de = compare(
    results_A, results_B,
    method="empirical",
    component_A=0, component_B=0,
)

# Gene-level analysis with practical significance threshold
results = de.gene_level(tau=jnp.log(1.1))

# Call DE genes (Bayesian decision)
is_de = de.call_genes(lfsr_threshold=0.05, prob_effect_threshold=0.95)
print(f"Found {is_de.sum()} DE genes")

# Summary table
print(de.summary(sort_by="lfsr", top_n=20))

Three DE Methods¶

The compare() factory returns different result types depending on method=:

Method	When to use	How it works
Parametric	Gaussian assumption holds	Analytic posteriors from fitted logistic-normal models
Empirical	General use (recommended)	Monte Carlo counting on Dirichlet-sampled compositions
Shrinkage	Most genes are not DE	Empirical Bayes shrinkage on top of empirical results

Parametric¶

Requires pre-fitted logistic-normal models. All statistics are computed analytically:

# Fit logistic-normal models first
model_A = results_A.fit_logistic_normal(rank=16)
model_B = results_B.fit_logistic_normal(rank=16)

de = compare(model_A, model_B, gene_names=adata.var_names.tolist())

Empirical (recommended)¶

Works directly with posterior samples via Monte Carlo counting -- no distributional assumptions required:

de = compare(
    results_A, results_B,
    method="empirical",
    component_A=0, component_B=0,
)

When results objects are passed, compare() automatically:

Extracts r samples from posterior_samples["r"]
Detects hierarchical models (gene-specific \(p\)) and extracts \(p\) samples
Infers gene names from results.var.index

Shrinkage¶

Improves on empirical DE by learning a genome-wide effect-size distribution and adaptively shrinking noisy estimates toward zero:

de = compare(
    results_A, results_B,
    method="shrinkage",
    component_A=0, component_B=0,
)

results = de.gene_level(tau=jnp.log(1.1))
print(f"Estimated null proportion: {de.null_proportion:.2%}")

Zero-copy upgrade

You can upgrade an existing empirical result to shrinkage without recomputing the expensive Dirichlet sampling:

de_emp = compare(results_A, results_B, method="empirical", ...)
de_shrink = de_emp.shrink()

Gene-Level Analysis¶

All DE result types share the same analysis API:

Method	Description
`de.gene_level(tau)`	Per-gene posterior summaries (lfsr and lfsr_tau)
`de.call_genes(tau, lfsr_threshold, prob_effect_threshold)`	Bayesian gene calling
`de.summary(tau, sort_by, top_n)`	Formatted results table
`de.to_dataframe(tau, target_pefp)`	Export to pandas DataFrame

Exporting results¶

# Basic export
df = de.to_dataframe(tau=0.5)

# With automatic PEFP-controlled DE calls
df = de.to_dataframe(tau=0.5, target_pefp=0.05)
de_genes = df[df["clr_is_de"]]

# Include biological metrics (empirical/shrinkage only)
df = de.to_dataframe(metrics="all", tau=0.5)

Bayesian Error Control¶

Unlike frequentist methods that use FDR, SCRIBE uses true posterior probabilities:

Metric	Description
lfsr	Local false sign rate -- posterior probability of having the wrong sign
lfsr_tau	Modified lfsr incorporating practical significance threshold \(\tau\)
PEFP	Posterior expected false discovery proportion -- Bayesian analogue of FDR

Controlling the false discovery rate¶

import jax.numpy as jnp

# Gene-level analysis with practical significance
results = de.gene_level(tau=jnp.log(1.1))

# Find the lfsr threshold that controls PEFP at 5%
threshold = de.find_threshold(target_pefp=0.05)
print(f"lfsr threshold for 5% PEFP: {threshold:.3f}")

# Call genes at that threshold
is_de = de.call_genes(lfsr_threshold=threshold)

# Verify the Bayesian FDR
pefp = de.compute_pefp(threshold=0.05)
print(f"Expected false discovery proportion: {pefp:.3f}")

Biological-Level DE¶

While CLR-based metrics operate in the compositional simplex, biological-level DE computes metrics directly on the denoised Negative Binomial distribution. This is especially valuable for lowly expressed genes where compositional artifacts dominate.

Metric	What it captures
Biological LFC	Mean expression shift: \(\log(\mu_A / \mu_B)\)
Log-variance ratio	Dispersion shift: \(\log(\text{var}_A / \text{var}_B)\)
Gamma Jeffreys divergence	Full distributional shift via symmetrized KL

# Biological-level DE (empirical/shrinkage only)
bio = de.biological_level(
    tau_lfc=jnp.log(1.5),
    tau_var=jnp.log(2.0),
    tau_kl=0.5,
)

bio["lfc_mean"]     # posterior mean biological LFC per gene
bio["lfc_lfsr"]     # local false sign rate for LFC
bio["lvr_mean"]     # posterior mean log-variance ratio
bio["kl_mean"]      # posterior mean Jeffreys divergence

Recommended workflow

Screen with CLR: use CLR-based lfsr as the primary filter
Validate with biological LFC: filter out compositional artifacts
Detect variance changes: inspect log-variance ratio for genes with small LFC but high distributional shift
Flag distributional shifts: use Jeffreys divergence as a catch-all

Gene Expression Filter¶

Low-expression genes can appear spuriously DE due to compositional artifacts. The gene_mask parameter aggregates filtered genes into a single "other" pseudo-gene before Dirichlet sampling:

from scribe.de import compare, compute_expression_mask

# Build a mask from MAP mean expression
mask = compute_expression_mask(
    results_A, results_B,
    component_A=0, component_B=0,
    min_mean_expression=1.0,
)

# Pass to compare()
de = compare(
    results_A, results_B,
    method="empirical",
    component_A=0, component_B=0,
    gene_mask=mask,
)

Interactive mask exploration¶

After the initial comparison, you can change the expression mask without re-running the expensive Dirichlet sampling:

# Explore a different threshold
de.set_expression_threshold(min_expression=3.0)
df2 = de.to_dataframe(tau=0.5)

# Apply a custom mask
de.set_gene_mask(my_custom_mask)

# Restore all genes
de.clear_mask()

Pathway Analysis¶

The empirical DE path supports pathway-level analysis via ILR balances:

Single pathway test¶

import jax.numpy as jnp

# Test whether a pathway shifts as a whole
pathway = jnp.array([10, 25, 42, 101, 200])
result = de.test_gene_set(pathway, tau=jnp.log(1.1))
print(f"Balance: {result['balance_mean']:.3f} +/- {result['balance_sd']:.3f}")
print(f"lfsr: {result['lfsr']:.4f}")

Within-pathway perturbation test¶

Detects coordinated rearrangement within a pathway even when the average balance is near zero:

perturb = de.test_pathway_perturbation(pathway, n_permutations=999)
print(f"Perturbation T: {perturb['t_obs']:.4f}, p={perturb['p_value']:.4f}")

Batch testing with PEFP control¶

gene_sets = [
    jnp.array([10, 25, 42, 101, 200]),
    jnp.array([5, 8, 15, 33]),
    jnp.array([60, 70, 80, 90, 100]),
]
batch = de.test_multiple_gene_sets(gene_sets, target_pefp=0.05)

Multi-Dataset Comparisons¶

For multi-dataset models, compare_datasets() is a convenience wrapper that slices per-dataset views and preserves within-posterior correlation:

from scribe.de import compare_datasets

de = compare_datasets(results, dataset_A=0, dataset_B=1)
de = compare_datasets(results, 0, 1, component=0, method="shrinkage")

Label-based component matching¶

When mixture models are fit with annotation_key, you can look up components by label (e.g., cell type name) instead of tracking indices manually:

from scribe.de import compare, match_components_by_label, get_shared_labels

# Find component indices for "Fibroblast" in both results
idx_A, idx_B = match_components_by_label(results_A, results_B, "Fibroblast")

de = compare(
    results_A, results_B,
    method="empirical",
    component_A=idx_A, component_B=idx_B,
)

# Discover labels shared across results
labels = get_shared_labels(results_A, results_B)

Class Hierarchy¶

ScribeDEResults (base)
├── ScribeParametricDEResults   -- analytic Gaussian
├── ScribeEmpiricalDEResults    -- Monte Carlo counting
└── ScribeShrinkageDEResults    -- empirical Bayes shrinkage (extends Empirical)

The compare() factory returns the appropriate subclass based on method=. All shared methods (call_genes, compute_pefp, find_threshold, summary) work identically on all three.

For the full API, see the API Reference.