Zero-Inflated Negative Binomial with Variable Capture Probability Model (ZINBVCP)

The Zero-Inflated Negative Binomial with Variable Capture Probability (ZINBVCP) model combines aspects of both the ZINB and NBVCP models to handle both technical dropouts and variable capture efficiencies in single-cell RNA sequencing data. This model is particularly useful when the data exhibits both excess zeros and significant variation in total UMI counts across cells.

The ZINBVCP model incorporates two key features:

Zero-inflation to model technical dropouts (from ZINB)
Cell-specific capture probabilities (from NBVCP)

Model Comparison with NBVCP and ZINB 

The ZINBVCP model extends both the NBVCP and ZINB models by combining their key features. From the NBVCP model, it inherits the cell-specific capture probabilities \(\nu^{(c)}\) that modify the base success probability \(p\). From the ZINB model, it inherits the gene-specific dropout probabilities \(\pi_g\) that model technical zeros.

The effective success probability for each cell \(c\) is computed as:

\[\hat{p}^{(c)} = \frac{p \nu^{(c)}}{1 - p (1 - \nu^{(c)})} \tag{1}\]

This is then combined with the dropout mechanism to give a zero-inflated distribution where the non-zero counts use the cell-specific effective probability.

Model Structure

The ZINBVCP model follows a hierarchical structure where:

Each gene has an associated dropout probability \(\pi_g\)
Each cell has an associated capture probability \(\nu^{(c)}\)
The base success probability \(p\) is modified by each cell’s capture probability
For genes that aren’t dropped out, counts follow a negative binomial with cell-specific effective probabilities

Formally, for a dataset with \(N\) cells and \(G\) genes, let \(u_{g}^{(c)}\) be the UMI count for gene \(g\) in cell \(c\). The generative process is:

Draw global success probability: \(p \sim \text{Beta}(\alpha_p, \beta_p)\)
For each gene \(g = 1,\ldots,G\):
- Draw dispersion parameter: \(r_g \sim \text{Gamma}(\alpha_r, \beta_r)\)
- Draw dropout probability: \(\pi_g \sim \text{Beta}(\alpha_{\pi}, \beta_{\pi})\)
For each cell \(c = 1,\ldots,N\):
- Draw capture probability: \(\nu^{(c)} \sim \text{Beta}(\alpha_{\nu}, \beta_{\nu})\)
- Compute effective probability: \(\hat{p}^{(c)} = \frac{p \nu^{(c)}}{1 - p (1 - \nu^{(c)})}\)
- For each gene \(g = 1,\ldots,G\):
  - Draw dropout indicator: \(z_g^{(c)} \sim \text{Bernoulli}(\pi_g)\)
  - If \(z_g^{(c)} = 1\): set \(u_g^{(c)} = 0\)
  - If \(z_g^{(c)} = 0\): draw :math:`u_g^{(c)} sim
  text{NegativeBinomial}(r_g, hat{p}^{(c)})`

Model Derivation

The ZINBVCP model combines the derivations of the NBVCP and ZINB models. Starting with the standard negative binomial model for mRNA counts:

\[m_g^{(c)} \sim \text{NegativeBinomial}(r_g, p) \tag{2}\]

We then model both the capture process and technical dropouts:

\[u_g^{(c)} \mid m_g^{(c)}, z_g^{(c)} \sim z_g^{(c)} \delta_0 + (1-z_g^{(c)}) \text{Binomial}(m_g^{(c)}, \nu^{(c)}) \tag{3}\]

where \(z_g^{(c)} \sim \text{Bernoulli}(\pi_g)\). Marginalizing over the unobserved mRNA counts \(m_g^{(c)}\) and dropout indicators \(z_g^{(c)}\), we get:

\[u_g^{(c)} \sim \pi_g \delta_0 + (1-\pi_g)\text{NegativeBinomial}(r_g, \hat{p}^{(c)}) \tag{4}\]

where \(\hat{p}^{(c)}\) is the effective probability defined in Eq. (1) and \(\delta_0\) is the Dirac delta function at zero.

Prior Distributions

The model uses the following prior distributions:

For the base success probability \(p\):

\[p \sim \text{Beta}(\alpha_p, \beta_p) \tag{5}\]

For each gene’s dispersion parameter \(r_g\):

\[r_g \sim \text{Gamma}(\alpha_r, \beta_r) \tag{6}\]

For each gene’s dropout probability \(\pi_g\):

\[\pi_g \sim \text{Beta}(\alpha_{\pi}, \beta_{\pi}) \tag{7}\]

For each cell’s capture probability \(\nu^{(c)}\):

\[\nu^{(c)} \sim \text{Beta}(\alpha_{\nu}, \beta_{\nu}) \tag{8}\]

Variational Posterior Distribution

The model uses stochastic variational inference with a mean-field variational family. The variational distributions are:

For the base success probability \(p\):

\[q(p) = \text{Beta}(\hat{\alpha}_p, \hat{\beta}_p) \tag{9}\]

For each gene’s dispersion parameter \(r_g\):

\[q(r_g) = \text{Gamma}(\hat{\alpha}_{r,g}, \hat{\beta}_{r,g}) \tag{10}\]

For each gene’s dropout probability \(\pi_g\):

\[q(\pi_g) = \text{Beta}(\hat{\alpha}_{\pi,g}, \hat{\beta}_{\pi,g}) \tag{11}\]

For each cell’s capture probability \(\nu^{(c)}\):

\[q(\nu^{(c)}) = \text{Beta}(\hat{\alpha}_{\nu}^{(c)}, \hat{\beta}_{\nu}^{(c)}) \tag{12}\]

where hatted parameters are learnable variational parameters.

Learning Algorithm

The training process follows similar steps to the NBVCP and ZINB models:

Initialize variational parameters:
- \(\hat{\alpha}_p = \alpha_p\), \(\hat{\beta}_p = \beta_p\)
- \(\hat{\alpha}_{r,g} = \alpha_r\), \(\hat{\beta}_{r,g} = \beta_r\) for all genes \(g\)
- \(\hat{\alpha}_{\pi,g} = \alpha_{\pi}\), \(\hat{\beta}_{\pi,g} = \beta_{\pi}\) for all genes \(g\)
- \(\hat{\alpha}_{\nu}^{(c)} = \alpha_{\nu}\), \(\hat{\beta}_{\nu}^{(c)} = \beta_{\nu}\) for all cells \(c\)
For each iteration:
- Sample mini-batch of cells
- Compute ELBO gradients
- Update parameters (using Adam optimizer as default)
Continue until maximum iterations reached

Implementation Details

The model is implemented using NumPyro with key features including:

Cell-specific parameter handling for capture probabilities
Gene-specific parameter handling for dropout probabilities
Effective probability computation through deterministic transformations
Zero-inflated distributions using NumPyro’s ZeroInflatedDistribution
Mini-batch support for scalable inference
GPU acceleration through JAX

Model Assumptions

The ZINBVCP model makes several key assumptions:

Zeros can arise from two processes:
- Technical dropouts (modeled by zero-inflation)
- Biological absence of expression (modeled by negative binomial)
Variation in total UMI counts partially reflects technical capture differences
Each cell has its own capture efficiency that affects all genes equally
Each gene has its own dropout probability
Genes are independent given the cell-specific capture probability
The base success probability represents true biological variation
Capture probabilities modify observed counts but not underlying biology

Usage Considerations

The ZINBVCP model is particularly suitable when:

The data exhibits excessive zeros beyond what a negative binomial predicts
Cells show high variability in total UMI counts
Both technical dropouts and capture efficiency variation are suspected
Standard library size normalization seems insufficient

It may be less suitable when:

The data is relatively clean with few technical artifacts
The zero-inflation or capture efficiency variation is minimal
The data contains multiple distinct cell populations (consider mixture models)

The model provides the most comprehensive treatment of technical artifacts among the non-mixture models in SCRIBE, accounting for both dropouts and capture efficiency variation. However, this flexibility comes at the cost of increased model complexity and computational demands.