.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/quickstart.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_quickstart.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_quickstart.py:


==================================
Quickstart with 10x Genomics Data
==================================

This tutorial will walk you through analyzing a real-world single-cell RNA-seq
dataset using SCRIBE. For this tutorial, we will use the Jurkat cells dataset
from 10x Genomics (available `here
<https://www.10xgenomics.com/datasets/jurkat-cells-1-standard-1-1-0>`_). To
ilustrate ``SCRIBE``'s flexibility, we will explore several models available,
starting from the basic Negative Binomial-Dirichlet Multinomial (NBDM) model
where all genes are negative binomially distributed with a shared success
probability :math:`p` and a gene-specific :math:`r_g` parameter. Later on, we
will account for the cell-to-cell variation in mRNA-to-UMI conversion efficiency
by explicitly modeling the mRNA capture efficiency. Finally, we will explore a
different parameterization of the model to help with convergence and
performance.

.. contents::
    :local:

.. GENERATED FROM PYTHON SOURCE LINES 25-29

Setup
-----

First, let's import the necessary libraries and set up our directories.

.. GENERATED FROM PYTHON SOURCE LINES 29-58

.. code-block:: Python


    import os
    from jax import random
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import scanpy as sc  # to load the data
    import scribe  # our main library
    import pickle  # to save the results

    # Define directories
    OUTPUT_DIR = "output"
    FIG_DIR = os.path.join(OUTPUT_DIR, "figures")

    # Create output directory
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    os.makedirs(FIG_DIR, exist_ok=True)

    # Simple cache directory. This directory is used to store the results of the
    # models for this tutorial to avoid re-running the models as we work through the
    # tutorial. YOU DO NOT NEED TO DO THIS IN YOUR OWN ANALYSIS.
    CACHE_DIR = "../../../CACHE"
    os.makedirs(CACHE_DIR, exist_ok=True)


    # Set plotting style and scanpy settings
    scribe.viz.matplotlib_style()
    sc.settings.verbosity = 1  # Reduce scanpy output verbosity for cleaner tutorial


.. GENERATED FROM PYTHON SOURCE LINES 59-71

Loading the Data
----------------

We will use the Jurkat cell dataset from 10x Genomics. This dataset consists
of ~3,200 human Jurkat cells, a T-lymphocyte cell line. For this tutorial,
we'll use the full transcriptome to demonstrate ``SCRIBE``'s ability to handle
large-scale single-cell datasets.

.. note::
  ``SCRIBE`` is designed to work with the **entire transcriptome** and
  can easily handle datasets with 20,000+ genes, as long as you have the right
  hardware. This is the recommended approach for comprehensive analysis.

.. GENERATED FROM PYTHON SOURCE LINES 71-96

.. code-block:: Python


    # Define path to the data
    data_path = os.path.abspath(
        os.path.join(
            os.path.dirname(scribe.__file__),
            # Here, put the path to the data.h5ad file in your computer.
            "../../data/10xGenomics/Jurkat_cells/data.h5ad",
        )
    )

    # Load the data using scanpy
    adata = sc.read_h5ad(data_path)

    # Extract raw counts for ``SCRIBE`` (``SCRIBE`` works with raw integer counts)
    counts = adata.X.toarray()

    # Define number of cells and genes
    n_cells = adata.n_obs
    n_genes = adata.n_vars

    print(f"Count matrix data type: {counts.dtype}")
    print(f"Total UMIs in dataset: {counts.sum():,.0f}")
    print(f"Mean UMIs per cell: {counts.sum(axis=1).mean():.1f}")
    print(f"Mean UMIs per gene: {counts.sum(axis=0).mean():.1f}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Count matrix data type: float32
    Total UMIs in dataset: 50,727,268
    Mean UMIs per cell: 15570.1
    Mean UMIs per gene: 1549.5


.. GENERATED FROM PYTHON SOURCE LINES 97-128

Model 1: Negative Binomial-Dirichlet Multinomial (NBDM)
-------------------------------------------------------

The core of ``SCRIBE`` is the Negative Binomial-Dirichlet Multinomial (NBDM)
model. This model is derived from first principles, where a two-state promoter
model can be shown to lead to a steady-state mRNA distribution that follows a
Negative Binomial (NB) distribution. The only extra-ingredient ``SCRIBE`` adds
is to assume that all genes share the same success probability parameter
:math:`p`. Biophysically, this is equivalent to assuming that all genes share
the same burst size.

A key insight from this model is that if all genes share the same success
probability parameter :math:`p` in their NB distributions, the joint
distribution of UMI counts for a cell can be factorized into two components:

1.  A Negative Binomial distribution for the total number of UMIs in the cell.
2.  A Dirichlet-Multinomial (DM) distribution for the relative proportions of
    gene counts.

This factorization allows ``SCRIBE`` to normalize gene expression levels as
the math reveals a natural scheme where the gene-specific :math:`r_g`
parameters can be used to compute the fraction of the transcriptome that each
gene occupies.

Let's fit the basic NBDM model to our data using Stochastic Variational
Inference (SVI). SVI is a type of variational inference that, although only
approximately correct, is very fast and scalable, perfect for exploring large
datasets quickly. For ``SCRIBE``, all we need to specify is the number of
steps to run the inference for (think of stochastic gradient descent for
optimization), the batch size (how many cells to process at once for each
optimization step), and a random seed for reproducibility.

.. GENERATED FROM PYTHON SOURCE LINES 128-157

.. code-block:: Python


    # Define inference parameters
    n_steps = 30_000
    batch_size = 512
    seed = 42

    # Simple caching for NBDM model. Again, you do not need to do this in your own
    # analysis if you don't want to; this is just not to re-run the model if we
    # already have the results.
    cache_file = os.path.join(
        CACHE_DIR, f"svi_quickstart_nbdm_standard_{n_genes}genes_{n_steps}steps.pkl"
    )
    if os.path.exists(cache_file):
        print(f"Loading NBDM results from {cache_file}")
        with open(cache_file, "rb") as f:
            results_nbdm = pickle.load(f)
    else:
        print("Running NBDM model...")
        results_nbdm = scribe.run_scribe(
            counts=counts,  # The count matrix
            n_steps=n_steps,  # The number of steps to run the inference for
            batch_size=batch_size,  # The batch size
            seed=seed,  # The random seed
        )
        # Save the results to a file.
        with open(cache_file, "wb") as f:
            pickle.dump(results_nbdm, f)
        print(f"Saved NBDM results to {cache_file}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Loading NBDM results from ../../../CACHE/svi_quickstart_nbdm_standard_32738genes_30000steps.pkl


.. GENERATED FROM PYTHON SOURCE LINES 158-172

Visualizing Results for NBDM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After fitting the model, we should always perform diagnostic checks to assess
the model fit. We'll look at the ELBO loss history first. What we are looking
for is that the loss is decreasing over time, and that it plateaus at some
point. Empirically, we have seen that anything between 25,000 and 50,000 steps
is usually enough to get a good fit.

.. note::
  Although SVI aims to maximize the ELBO, here we are
  looking for a decreasing loss function, this is because the actual loss
  function implemented in SVI is the negative ELBO, also known as the
  Variational Free Energy.

.. GENERATED FROM PYTHON SOURCE LINES 172-182

.. code-block:: Python


    # Plot ELBO loss history
    fig, ax = plt.subplots(figsize=(3.5, 3))
    ax.plot(results_nbdm.loss_history)
    ax.set_xlabel("step")
    ax.set_ylabel("ELBO Loss")
    ax.set_title("NBDM ELBO Loss")
    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_001.png
   :alt: NBDM ELBO Loss
   :srcset: /auto_examples/images/sphx_glr_quickstart_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 183-237

This looks good. The loss is decreasing over time, and it plateaus at some
point.

After confirming that the optimization is converging, one of the most
important steps in any Bayesian model (and one could argue in any statistical
model) is to perform posterior predictive checks (PPCs).

PPCs are a way to check if the model is able to reproduce the key features of
the observed data. The logic is straightforward:

1. **Generate Synthetic Data**: Use the fitted model to generate new datasets
   that should resemble the original data if the model is appropriate. This
   means that we take a sample of the posterior parameters and run them
   through the likelihood function to simulate a new dataset. We repeat this
   process multiple times to get a distribution of the predicted data.

2. **Compare Distributions**: Plot the distribution of observed counts
   alongside the distribution of model-predicted counts for the same genes.
   This is done by plotting the distribution of the observed counts alongside
   the distribution of the predicted counts for the same genes. Usually, we
   can plot quantiles of the predicted data to get a sense of the
   distribution.

3. **Assess Model Fit**: If the model is good, the observed data should fall
   within the credible intervals of the predicted data most of the time.

.. important::
   This is not a common practice in the field of single-cell RNA-seq analysis,
   but we argue it should be! A simple visual inspection of the quality of the
   fit is vital to understand how well the model is able to capture the data.

Strategic Gene Selection for Posterior Predictive Checks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before generating PPC samples, we need to select a subset of genes to analyze.
This is because generating PPC samples is a computationally and memory
intensive process. Usually, we would like to generate between 100 and 1000
samples; each sample being a full count matrix with the same number of cells
and genes as the original data. You can see how this can get out of hand very
quickly. Fortunately, ``SCRIBE``'s results objects are **indexable**, allowing
us to subset the results to only the genes we want to visualize.

Rather than randomly selecting genes for visualization, we'll use a strategic
approach that ensures we capture genes across different expression levels.
This is done by calculating the median expression for each gene across all
cells in the dataset, filtering out completely unexpressed genes (median = 0),
sorting genes by their median expression, and selecting evenly spaced genes
across the expression spectrum. However, feel free to use whatever method you
prefer.

.. note::
  We filter genes with median expression equal to 0 for visualization
  purposes. However, these genes were taken into account for the model fit.
  This is the power of the full Bayesian framework ``SCRIBE`` offers.

.. GENERATED FROM PYTHON SOURCE LINES 237-300

.. code-block:: Python


    def select_genes_for_visualization(counts, n_genes=25):
        """
        Select a representative subset of genes for visualization based on their
        expression levels.

        This function implements a stratified sampling approach that:
            1. Calculates median expression for each gene across all cells
            2. Filters out completely unexpressed genes (median = 0)
            3. Sorts genes by their median expression
            4. Selects evenly spaced genes across the expression spectrum

        Parameters
        ----------
        counts : array-like, shape (n_cells, n_genes)
            The count matrix where rows are cells and columns are genes.
        n_genes : int, default=25
            Number of genes to select for visualization.

        Returns
        -------
        selected_idx : array
            Indices of selected genes, sorted by expression level.
        median_expr : array
            Median expression values for all genes.
        """
        # Calculate median expression across cells for each gene
        # We use median instead of mean because it's more robust to outliers
        # and better represents the typical expression level
        median_expr = np.median(counts, axis=0)

        # Find genes that are expressed in at least some cells
        # Genes with median = 0 are likely unexpressed or very lowly expressed
        expressed_idx = np.where(median_expr > 0)[0]

        # Sort expressed genes by their median expression level
        sorted_idx = expressed_idx[np.argsort(median_expr[expressed_idx])]

        # Select evenly spaced genes across the expression spectrum
        # This ensures we sample from low, medium, and high expression ranges
        spaced_indices = np.linspace(0, len(sorted_idx) - 1, num=n_genes, dtype=int)
        selected_idx = sorted_idx[spaced_indices]

        return selected_idx, median_expr


    # Select genes using our strategic approach
    n_genes_to_plot = 25
    selected_idx, median_expr = select_genes_for_visualization(
        counts, n_genes=n_genes_to_plot
    )

    # Sort selected indices - this is crucial for proper indexing of results!
    # This ensures correspondence between subset results and original gene indices
    selected_idx = np.sort(selected_idx)

    print(f"Selected {len(selected_idx)} genes for PPC analysis")
    print(
        f"Expression range: {median_expr[selected_idx].min():.2f} - "
        f"{median_expr[selected_idx].max():.2f}"
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Selected 25 genes for PPC analysis
    Expression range: 0.50 - 167.00


.. GENERATED FROM PYTHON SOURCE LINES 301-304

We are now ready to generate the PPC samples. We will generate 500 samples,
i.e., 500 simulated datasets with the same number of cells :math:`\times` the
number of genes we selected.

.. GENERATED FROM PYTHON SOURCE LINES 304-317

.. code-block:: Python


    n_samples = (
        500  # Can use more samples now since we're only doing selected genes
    )

    # Subset results to selected genes before generating samples (major memory
    # savings!)
    results_nbdm_subset = results_nbdm[selected_idx]
    # Generate the PPC samples
    ppc_nbdm = results_nbdm_subset.get_ppc_samples(
        n_samples=n_samples, rng_key=random.PRNGKey(seed)
    )


.. GENERATED FROM PYTHON SOURCE LINES 318-321

Now, since we will generate a plot for each selected gene, we need to
calculate the optimal number of rows and columns for the plot grid. Let's
define a simple function to do this.

.. GENERATED FROM PYTHON SOURCE LINES 321-361

.. code-block:: Python


    def calculate_subplot_grid(n_plots):
        """
        Calculate optimal subplot grid dimensions for a given number of plots.

        Tries to create a square or near-square grid that accommodates all plots.

        Parameters
        ----------
        n_plots : int
            Number of subplots needed

        Returns
        -------
        nrows, ncols : int, int
            Number of rows and columns for the subplot grid
        """
        import math

        # For perfect squares, use square grid
        sqrt_n = int(math.sqrt(n_plots))
        if sqrt_n * sqrt_n == n_plots:
            return sqrt_n, sqrt_n

        # For non-perfect squares, find the closest rectangular grid
        # that minimizes empty subplots
        for cols in range(sqrt_n, n_plots + 1):
            rows = math.ceil(n_plots / cols)
            if (
                rows * cols >= n_plots and abs(rows - cols) <= 2
            ):  # Prefer near-square
                return rows, cols

        # Fallback: use ceiling of square root
        rows = math.ceil(math.sqrt(n_plots))
        cols = math.ceil(n_plots / rows)
        return rows, cols


.. GENERATED FROM PYTHON SOURCE LINES 362-376

Posterior Predictive Check Visualization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We are ready to plot the PPCs. We will plot the credible regions for each gene
as colored bands, and the observed data as a black line. We will also plot the
median expression for each gene. `SCRIBE` provides several useful functions to
generate these plots.

.. admonition:: Interpreting the Plots
  - The colored regions show credible intervals (95%, 68%, 50%) of the model's
  predictions across multiple posterior samples.
  -  The black line shows the actual observed data distribution.
  - Good fit: observed data falls mostly within the credible regions.
  - Poor fit: observed data consistently falls outside credible regions.

.. GENERATED FROM PYTHON SOURCE LINES 376-452

.. code-block:: Python


    # Plot PPCs for selected genes with dynamic grid layout
    n_genes_to_plot = len(selected_idx)
    nrows, ncols = calculate_subplot_grid(n_genes_to_plot)
    fig_width = max(12, ncols * 3)  # Minimum 12 inches, scale with columns
    fig_height = max(8, nrows * 2.5)  # Minimum 8 inches, scale with rows

    # Print the number of rows and columns for the plot grid
    print(f"Creating {nrows}x{ncols} grid for {n_genes_to_plot} genes")

    # Initialize the plot
    fig, axes = plt.subplots(nrows, ncols, figsize=(fig_width, fig_height))
    # Flatten the axes if we have more than one gene to plot
    axes = axes.flatten() if n_genes_to_plot > 1 else [axes]

    # Add a title to the plot
    fig.suptitle(
        f"NBDM Posterior Predictive Checks ({n_genes_to_plot} genes)",
        y=0.98,
        fontsize=16,
    )

    # Loop over the genes to plot
    for i in range(n_genes_to_plot):
        ax = axes[i]
        gene_idx = selected_idx[i]
        gene_median_expr = median_expr[gene_idx]

        print(
            f"Generating PPC for gene {gene_idx} "
            f"(median expression: {gene_median_expr:.2f})"
        )

        # Compute credible regions for this gene's predicted counts
        # Note: gene_idx is now the index within selected genes, not the original
        # dataset
        credible_regions = scribe.stats.compute_histogram_credible_regions(
            ppc_nbdm["predictive_samples"][
                :, :, i
            ],  # Use i (position in selected genes)
            credible_regions=[95, 68, 50],
        )

        # Plot the credible regions as colored bands
        scribe.viz.plot_histogram_credible_regions_stairs(
            ax, credible_regions, cmap="Blues", alpha=0.5
        )

        # Calculate and plot the observed data histogram
        bin_edges = credible_regions["bin_edges"]
        hist, _ = np.histogram(counts[:, gene_idx], bins=bin_edges, density=True)
        ax.stairs(
            hist, bin_edges, color="black", alpha=0.8, linewidth=2, label="Observed"
        )

        # Enhanced axis labels with expression information
        ax.set_xlabel(f"UMI counts")
        ax.set_ylabel("density")
        ax.set_title(
            f"{adata.var.index.values[gene_idx]}\n"
            f"(Median expr: {gene_median_expr:.1f})",
            fontsize=10,
        )
        ax.legend(fontsize=8)

        # Improve readability for low-count genes
        if gene_median_expr < 1.0:
            ax.set_xlim(0, max(10, np.percentile(counts[:, gene_idx], 95)))

    # Hide empty subplots if we have more subplot positions than genes
    for j in range(n_genes_to_plot, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_002.png
   :alt: NBDM Posterior Predictive Checks (25 genes), UBE2T (Median expr: 3.0), CENPA (Median expr: 1.0), SNRPG (Median expr: 7.0), UBE2E1 (Median expr: 1.0), MED10 (Median expr: 1.0), RNF145 (Median expr: 1.0), SNRPC (Median expr: 4.0), HSP90AB1 (Median expr: 24.0), DDX56 (Median expr: 1.0), BAG1 (Median expr: 2.0), DPP7 (Median expr: 1.0), HSPA14 (Median expr: 1.0), YIF1A (Median expr: 2.0), AIP (Median expr: 1.0), MRPL51 (Median expr: 11.0), TSFM (Median expr: 1.0), UBE2N (Median expr: 2.0), EBPL (Median expr: 3.0), CALM1 (Median expr: 5.0), RPS2 (Median expr: 167.0), CMC2 (Median expr: 1.0), CRLS1 (Median expr: 1.0), MRGBP (Median expr: 0.5), POLR2E (Median expr: 2.0), BAX (Median expr: 1.0)
   :srcset: /auto_examples/images/sphx_glr_quickstart_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Creating 5x5 grid for 25 genes
    Generating PPC for gene 2629 (median expression: 3.00)
    Generating PPC for gene 3488 (median expression: 1.00)
    Generating PPC for gene 3902 (median expression: 7.00)
    Generating PPC for gene 5592 (median expression: 1.00)
    Generating PPC for gene 8585 (median expression: 1.00)
    Generating PPC for gene 9886 (median expression: 1.00)
    Generating PPC for gene 10744 (median expression: 4.00)
    Generating PPC for gene 10916 (median expression: 24.00)
    Generating PPC for gene 12225 (median expression: 1.00)
    Generating PPC for gene 15825 (median expression: 2.00)
    Generating PPC for gene 16829 (median expression: 1.00)
    Generating PPC for gene 17017 (median expression: 1.00)
    Generating PPC for gene 19203 (median expression: 2.00)
    Generating PPC for gene 19259 (median expression: 1.00)
    Generating PPC for gene 20152 (median expression: 11.00)
    Generating PPC for gene 20954 (median expression: 1.00)
    Generating PPC for gene 21229 (median expression: 2.00)
    Generating PPC for gene 22057 (median expression: 3.00)
    Generating PPC for gene 23206 (median expression: 5.00)
    Generating PPC for gene 24673 (median expression: 167.00)
    Generating PPC for gene 25837 (median expression: 1.00)
    Generating PPC for gene 28675 (median expression: 1.00)
    Generating PPC for gene 29321 (median expression: 0.50)
    Generating PPC for gene 29443 (median expression: 2.00)
    Generating PPC for gene 30931 (median expression: 1.00)


.. GENERATED FROM PYTHON SOURCE LINES 453-466

We can see that the fit is already pretty good. Most of the observed counts
fall within the colored regions, meaning that the inferred parameters are able
to capture the observed data. However, we can improve the fit by using a more
sophisticated model.

Model 2: NBDM with Variable Capture Probability (NBVCP)
-------------------------------------------------------

The basic NBDM model assumes that the capture efficiency—the probability of an
mRNA molecule being captured and sequenced as a UMI—is constant for all cells.
However, this is often not true in practice due to technical variations. To
demonstrate this, let's plot the distribution of total UMI counts for each
cell.

.. GENERATED FROM PYTHON SOURCE LINES 467-477

.. code-block:: Python


    # Initialize figure
    fig, ax = plt.subplots(figsize=(3.5, 3))
    sns.histplot(counts.sum(axis=1), bins=100, ax=ax)
    ax.set_xlabel("total UMI counts")
    ax.set_ylabel("count")
    ax.set_title("Distribution of Total UMI Counts")
    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_003.png
   :alt: Distribution of Total UMI Counts
   :srcset: /auto_examples/images/sphx_glr_quickstart_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 478-511

These cells were already filtered for quality and still we have cells with
:math:`\approx` 5,000 UMIs vs cells with :math:`\approx` 40,000 UMIs. This is
unlikely to be due to biological differences, but rather to technical
differences in the capture efficiency.

``SCRIBE`` can account for this by using a Variable Capture Probability (VCP)
model. As mentioned earlier, in the derivation of the NBDM model, we assumed
all mRNA counts to be negative-binomially distributed. However, what we
observe in an experiment are not the mRNA counts, but the UMIs. The simplest
way to account for this conversion is to assume that each mRNA molecule in
cell :math:`c` has a probability :math:`\nu^{(c)}` of being captured and
sequenced as a UMI. We can show that after accounting for this effect, the
resulting UMI counts distribution is also negative-binomially distributed.
However, the parameter :math:`p` is modified in a non-linear was as

.. math:: \hat{p}^{(c)} = \frac{p}{\nu^{(c)} + p (1 - \nu^{(c)})}.

In this model, the capture efficiency :math:`\nu^{(c)}` is a cell-specific
latent variable that is inferred from the data. This allows the model to
distinguish between biological zeros (genes not expressed) and technical zeros
(genes expressed but not detected).

.. important::
  Note that this is very different from the conventional approach of assuming
  a zero-inflated negative binomial distribution (which ``SCRIBE`` also
  supports). Zero-inflation in that sense is a per-gene parameter, while our
  approach is a per-cell parameter. Although it might be possible that there
  is a per-gene capture efficiency across all cells, we find it much more
  likely that the tehcnical variation comes from processing each cell
  separately.

Let's fit the NBVCP model. The beauty of the API is that we can fit this
variant by simply passing the ``variable_capture=True`` flag.

.. GENERATED FROM PYTHON SOURCE LINES 512-535

.. code-block:: Python


    # Simple caching for NBVCP model
    cache_file = os.path.join(
        CACHE_DIR,
        f"svi_quickstart_nbvcp_standard_{n_genes}genes_{n_steps}steps.pkl",
    )
    if os.path.exists(cache_file):
        print(f"Loading NBVCP results from {cache_file}")
        with open(cache_file, "rb") as f:
            results_vcp = pickle.load(f)
    else:
        print("Running NBVCP model...")
        results_vcp = scribe.run_scribe(
            counts=counts,
            variable_capture=True,  # this is the only difference from the NBDM model
            n_steps=n_steps,
            batch_size=batch_size,
            seed=seed,
        )
        with open(cache_file, "wb") as f:
            pickle.dump(results_vcp, f)
        print(f"Saved NBVCP results to {cache_file}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Loading NBVCP results from ../../../CACHE/svi_quickstart_nbvcp_standard_32738genes_30000steps.pkl


.. GENERATED FROM PYTHON SOURCE LINES 536-540

Visualizing Results for NBVCP
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, let's look at the diagnostics for the NBVCP model.

.. GENERATED FROM PYTHON SOURCE LINES 540-550

.. code-block:: Python


    # Plot ELBO loss history
    fig, ax = plt.subplots(figsize=(3.5, 3))
    ax.plot(results_vcp.loss_history)
    ax.set_xlabel("Step")
    ax.set_ylabel("ELBO Loss")
    ax.set_title("NBVCP ELBO Loss")
    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_004.png
   :alt: NBVCP ELBO Loss
   :srcset: /auto_examples/images/sphx_glr_quickstart_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 551-554

As before, we can visualize the posterior predictive checks for the NBVCP
model. The unified API makes it straightforward to do so by following the same
steps as before.

.. GENERATED FROM PYTHON SOURCE LINES 554-622

.. code-block:: Python


    # Generate PPC samples for the NBVCP model (using same gene subset)
    results_vcp_subset = results_vcp[selected_idx]
    ppc_vcp = results_vcp_subset.get_ppc_samples(
        n_samples=n_samples, rng_key=random.PRNGKey(seed)
    )

    # Plot PPCs using the same strategically selected genes for fair comparison
    fig, axes = plt.subplots(nrows, ncols, figsize=(fig_width, fig_height))
    axes = axes.flatten() if n_genes_to_plot > 1 else [axes]
    fig.suptitle(
        f"NBVCP Posterior Predictive Checks ({n_genes_to_plot} genes)\n"
        f"(Same Gene Selection for Comparison)",
        y=0.98,
        fontsize=16,
    )

    for i in range(n_genes_to_plot):
        ax = axes[i]
        gene_idx = selected_idx[i]
        gene_median_expr = median_expr[gene_idx]

        print(
            f"Generating NBVCP PPC for gene {gene_idx} "
            f"(median expression: {gene_median_expr:.2f})"
        )

        # Compute credible regions for this gene's predicted counts
        credible_regions = scribe.stats.compute_histogram_credible_regions(
            ppc_vcp["predictive_samples"][
                :, :, i
            ],  # Use i (position in selected genes)
            credible_regions=[95, 68, 50],
        )

        # Plot the credible regions using green colormap to distinguish from NBDM
        scribe.viz.plot_histogram_credible_regions_stairs(
            ax, credible_regions, cmap="Greens", alpha=0.5
        )

        # Calculate and plot the observed data histogram
        bin_edges = credible_regions["bin_edges"]
        hist, _ = np.histogram(counts[:, gene_idx], bins=bin_edges, density=True)
        ax.stairs(
            hist, bin_edges, color="black", alpha=0.8, linewidth=2, label="Observed"
        )

        # Enhanced axis labels with expression information
        ax.set_xlabel(f"UMI counts")
        ax.set_ylabel("density")
        ax.set_title(
            f"{adata.var.index.values[gene_idx]}\n"
            f"(Median expr: {gene_median_expr:.1f})",
            fontsize=10,
        )
        ax.legend(fontsize=8)

        # Improve readability for low-count genes
        if gene_median_expr < 1.0:
            ax.set_xlim(0, max(10, np.percentile(counts[:, gene_idx], 95)))

    # Hide empty subplots if we have more subplot positions than genes
    for j in range(n_genes_to_plot, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_005.png
   :alt: NBVCP Posterior Predictive Checks (25 genes) (Same Gene Selection for Comparison), UBE2T (Median expr: 3.0), CENPA (Median expr: 1.0), SNRPG (Median expr: 7.0), UBE2E1 (Median expr: 1.0), MED10 (Median expr: 1.0), RNF145 (Median expr: 1.0), SNRPC (Median expr: 4.0), HSP90AB1 (Median expr: 24.0), DDX56 (Median expr: 1.0), BAG1 (Median expr: 2.0), DPP7 (Median expr: 1.0), HSPA14 (Median expr: 1.0), YIF1A (Median expr: 2.0), AIP (Median expr: 1.0), MRPL51 (Median expr: 11.0), TSFM (Median expr: 1.0), UBE2N (Median expr: 2.0), EBPL (Median expr: 3.0), CALM1 (Median expr: 5.0), RPS2 (Median expr: 167.0), CMC2 (Median expr: 1.0), CRLS1 (Median expr: 1.0), MRGBP (Median expr: 0.5), POLR2E (Median expr: 2.0), BAX (Median expr: 1.0)
   :srcset: /auto_examples/images/sphx_glr_quickstart_005.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Generating NBVCP PPC for gene 2629 (median expression: 3.00)
    Generating NBVCP PPC for gene 3488 (median expression: 1.00)
    Generating NBVCP PPC for gene 3902 (median expression: 7.00)
    Generating NBVCP PPC for gene 5592 (median expression: 1.00)
    Generating NBVCP PPC for gene 8585 (median expression: 1.00)
    Generating NBVCP PPC for gene 9886 (median expression: 1.00)
    Generating NBVCP PPC for gene 10744 (median expression: 4.00)
    Generating NBVCP PPC for gene 10916 (median expression: 24.00)
    Generating NBVCP PPC for gene 12225 (median expression: 1.00)
    Generating NBVCP PPC for gene 15825 (median expression: 2.00)
    Generating NBVCP PPC for gene 16829 (median expression: 1.00)
    Generating NBVCP PPC for gene 17017 (median expression: 1.00)
    Generating NBVCP PPC for gene 19203 (median expression: 2.00)
    Generating NBVCP PPC for gene 19259 (median expression: 1.00)
    Generating NBVCP PPC for gene 20152 (median expression: 11.00)
    Generating NBVCP PPC for gene 20954 (median expression: 1.00)
    Generating NBVCP PPC for gene 21229 (median expression: 2.00)
    Generating NBVCP PPC for gene 22057 (median expression: 3.00)
    Generating NBVCP PPC for gene 23206 (median expression: 5.00)
    Generating NBVCP PPC for gene 24673 (median expression: 167.00)
    Generating NBVCP PPC for gene 25837 (median expression: 1.00)
    Generating NBVCP PPC for gene 28675 (median expression: 1.00)
    Generating NBVCP PPC for gene 29321 (median expression: 0.50)
    Generating NBVCP PPC for gene 29443 (median expression: 2.00)
    Generating NBVCP PPC for gene 30931 (median expression: 1.00)


.. GENERATED FROM PYTHON SOURCE LINES 623-668

For this particular dataset, the difference between the NBDM and NBVCP models
is not very pronounced. However, in general, we recommend using the NBVCP
model when there is evidence of technical variation in the capture efficiency.

Moreover, even though the data falls within the colored regions, there also
seems to be "unnecessary" uncertainty. Take for example the HSP90AB1 gene
(second row, third column). Even though the data falls within the colored
regions, these regions seem to be unnecessarily wide. The reason for this is
subtle but important: When performing SVI, we use what is called a
"mean-field" approximaiton. This means that we fit each parameter
independently, ignoring all possible correlations between parameters. This
allows us to use the power of modern gradient-based optimizers to find the
optimal parameters. However, this also bring an issue when working with the
negative binomial distribution, where the :math:`r` and :math:`p` parameters
trade-off with each other---meaning that increasing one parameter while
decreasing the other results in a distribution with a very similar shape.

To address this issue, ``SCRIBE`` comes with two other parameterizations of
the same base NBDM model:

- the ``linked`` parameterization, in which we fit the same :math:`p`
  parameter, but instead of fitting the :math:`r` parameter, we fit the
  distribution mean :math:`\mu` parameter. We then can recover the :math:`r`
  parameter as :math:`r = \mu * (1 - p) / p`.
- the ``odds-ratio`` parameterization, in which we fit the so-called
  odds-ratio :math:`\phi = (1 - p) / p` parameter, and the :math:`\mu`
  parameter. We then can recover the :math:`r` parameter as :math:`r = \mu *
  \phi`.

Changing between parameterizations is as simple as passing the appropriate
``parameterization`` flag to the ``run_scribe`` function. Let's fit the NBVCP
model with the odds-ratio parameterization.

Model 3: Odds-Ratio Parameterization
------------------------------------

This reparameterization can lead to more stable and efficient inference,
especially when the success probability `p` is very close to 0 or 1. Moreover,
because of the mean-field approximation done for stochastic variational
inference, this parameterization captures the known correlation between the
`r` and `p` parameters by fitting the mean `mu` instead of the dispersion `r`.
This results in much better posterior predictive checks.

Let's fit the NBVCP model with the odds-ratio parameterization. Again, we can
do this by simply passing the `parameterization="odds_ratio"` flag.

.. GENERATED FROM PYTHON SOURCE LINES 669-693

.. code-block:: Python


    # Simple caching for Odds-Ratio NBVCP model
    cache_file = os.path.join(
        CACHE_DIR,
        f"svi_quickstart_nbvcp_odds-ratio_{n_genes}genes_{n_steps}steps.pkl",
    )
    if os.path.exists(cache_file):
        print(f"Loading Odds-Ratio NBVCP results from {cache_file}")
        with open(cache_file, "rb") as f:
            results_or = pickle.load(f)
    else:
        print("Running Odds-Ratio NBVCP model...")
        results_or = scribe.run_scribe(
            counts=counts,
            variable_capture=True,
            parameterization="odds_ratio",
            n_steps=n_steps,
            batch_size=batch_size,
            seed=seed,
        )
        with open(cache_file, "wb") as f:
            pickle.dump(results_or, f)
        print(f"Saved Odds-Ratio NBVCP results to {cache_file}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Loading Odds-Ratio NBVCP results from ../../../CACHE/svi_quickstart_nbvcp_odds-ratio_32738genes_30000steps.pkl


.. GENERATED FROM PYTHON SOURCE LINES 694-698

Visualizing Results for Odds-Ratio Model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Finally, let's check the diagnostics for the odds-ratio model.

.. GENERATED FROM PYTHON SOURCE LINES 698-708

.. code-block:: Python


    # Plot ELBO loss history
    fig, ax = plt.subplots(figsize=(3.5, 3))
    ax.plot(results_or.loss_history)
    ax.set_xlabel("Step")
    ax.set_ylabel("ELBO Loss")
    ax.set_title("Odds-Ratio NBVCP ELBO Loss")
    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_006.png
   :alt: Odds-Ratio NBVCP ELBO Loss
   :srcset: /auto_examples/images/sphx_glr_quickstart_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 709-710

And the PPCs.

.. GENERATED FROM PYTHON SOURCE LINES 710-779

.. code-block:: Python


    # Generate PPC samples for the odds-ratio model (using same gene subset)
    results_or_subset = results_or[selected_idx]
    ppc_or = results_or_subset.get_ppc_samples(
        n_samples=n_samples, rng_key=random.PRNGKey(seed)
    )

    # Plot PPCs using the same strategically selected genes for consistent comparison
    fig, axes = plt.subplots(nrows, ncols, figsize=(fig_width, fig_height))
    axes = axes.flatten() if n_genes_to_plot > 1 else [axes]
    fig.suptitle(
        f"Odds-Ratio NBVCP Posterior Predictive Checks ({n_genes_to_plot} genes)\n"
        f"(Consistent Gene Selection)",
        y=0.98,
        fontsize=16,
    )

    for i in range(n_genes_to_plot):
        ax = axes[i]
        gene_idx = selected_idx[i]
        gene_median_expr = median_expr[gene_idx]

        print(
            f"Generating Odds-Ratio PPC for gene {gene_idx} "
            f"(median expression: {gene_median_expr:.2f})"
        )

        # Compute credible regions for this gene's predicted counts
        credible_regions = scribe.stats.compute_histogram_credible_regions(
            ppc_or["predictive_samples"][
                :, :, i
            ],  # Use i (position in selected genes)
            credible_regions=[95, 68, 50],
        )

        # Plot the credible regions using orange colormap to distinguish from other
        # models
        scribe.viz.plot_histogram_credible_regions_stairs(
            ax, credible_regions, cmap="Oranges", alpha=0.5
        )

        # Calculate and plot the observed data histogram
        bin_edges = credible_regions["bin_edges"]
        hist, _ = np.histogram(counts[:, gene_idx], bins=bin_edges, density=True)
        ax.stairs(
            hist, bin_edges, color="black", alpha=0.8, linewidth=2, label="Observed"
        )

        # Enhanced axis labels with expression information
        ax.set_xlabel(f"UMI counts")
        ax.set_ylabel("density")
        ax.set_title(
            f"{adata.var.index.values[gene_idx]}\n"
            f"(Median expr: {gene_median_expr:.1f})",
            fontsize=10,
        )
        ax.legend(fontsize=8)

        # Improve readability for low-count genes
        if gene_median_expr < 1.0:
            ax.set_xlim(0, max(10, np.percentile(counts[:, gene_idx], 95)))

    # Hide empty subplots if we have more subplot positions than genes
    for j in range(n_genes_to_plot, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_quickstart_007.png
   :alt: Odds-Ratio NBVCP Posterior Predictive Checks (25 genes) (Consistent Gene Selection), UBE2T (Median expr: 3.0), CENPA (Median expr: 1.0), SNRPG (Median expr: 7.0), UBE2E1 (Median expr: 1.0), MED10 (Median expr: 1.0), RNF145 (Median expr: 1.0), SNRPC (Median expr: 4.0), HSP90AB1 (Median expr: 24.0), DDX56 (Median expr: 1.0), BAG1 (Median expr: 2.0), DPP7 (Median expr: 1.0), HSPA14 (Median expr: 1.0), YIF1A (Median expr: 2.0), AIP (Median expr: 1.0), MRPL51 (Median expr: 11.0), TSFM (Median expr: 1.0), UBE2N (Median expr: 2.0), EBPL (Median expr: 3.0), CALM1 (Median expr: 5.0), RPS2 (Median expr: 167.0), CMC2 (Median expr: 1.0), CRLS1 (Median expr: 1.0), MRGBP (Median expr: 0.5), POLR2E (Median expr: 2.0), BAX (Median expr: 1.0)
   :srcset: /auto_examples/images/sphx_glr_quickstart_007.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Generating Odds-Ratio PPC for gene 2629 (median expression: 3.00)
    Generating Odds-Ratio PPC for gene 3488 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 3902 (median expression: 7.00)
    Generating Odds-Ratio PPC for gene 5592 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 8585 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 9886 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 10744 (median expression: 4.00)
    Generating Odds-Ratio PPC for gene 10916 (median expression: 24.00)
    Generating Odds-Ratio PPC for gene 12225 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 15825 (median expression: 2.00)
    Generating Odds-Ratio PPC for gene 16829 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 17017 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 19203 (median expression: 2.00)
    Generating Odds-Ratio PPC for gene 19259 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 20152 (median expression: 11.00)
    Generating Odds-Ratio PPC for gene 20954 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 21229 (median expression: 2.00)
    Generating Odds-Ratio PPC for gene 22057 (median expression: 3.00)
    Generating Odds-Ratio PPC for gene 23206 (median expression: 5.00)
    Generating Odds-Ratio PPC for gene 24673 (median expression: 167.00)
    Generating Odds-Ratio PPC for gene 25837 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 28675 (median expression: 1.00)
    Generating Odds-Ratio PPC for gene 29321 (median expression: 0.50)
    Generating Odds-Ratio PPC for gene 29443 (median expression: 2.00)
    Generating Odds-Ratio PPC for gene 30931 (median expression: 1.00)


.. GENERATED FROM PYTHON SOURCE LINES 780-800

As expected, the posterior predictive checks for this parameterization look
much tighter compared to the previous two models. Looking at our HSP90AB1 gene
(second row, third column), we can see that the posterior predictive checks
are much tighter compared to the previous two models.

Capturing the correlation between the :math:`r` and :math:`p` parameters by
fitting the mean and the odds-ratio parameter allows the simulated datasets to
better match the observed data.

Conclusion
----------

In this tutorial, we demonstrated how to use ``SCRIBE`` to fit the NBDM,
NBVCP, and odds-ratio models to a single-cell RNA-seq dataset. We also
visualized the results of the models and compared the posterior predictive
checks. We saw that the odds-ratio parameterization leads to much tighter
posterior predictive checks compared to the other two models. ``SCRIBE`` is a
versatile tool that has more models and parameterizations to explore. We
invite you to explore the rest of the documentation to learn more about how
can ``SCRIBE`` help you analyze your own data.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 43.271 seconds)


.. _sphx_glr_download_auto_examples_quickstart.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: quickstart.ipynb <quickstart.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: quickstart.py <quickstart.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: quickstart.zip <quickstart.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_