5050 jurkat 293t

import%20marimo%0A%0A__generated_with%20%3D%20%220.23.1%22%0Aapp%20%3D%20marimo.App()%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20import%20marimo%20as%20mo%0A%0A%20%20%20%20return%20(mo%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%20Mixture%20model%20%2B%20differential%20expression%20with%20%60scribe%60%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20In%20this%20tutorial%20we%20will%20use%20%60scribe%60%20to%20do%20two%20things%20that%20are%20routine%20in%20a%20single-cell%20analysis%20pipeline%3A%0A%0A%20%20%20%201.%20**Unsupervised%20cell-type%20discovery**%20%E2%80%94%20fit%20a%20two-component%20mixture%20model%20to%20an%20unlabeled%20dataset%20and%20let%20it%20assign%20each%20cell%20to%20a%20cluster.%0A%20%20%20%202.%20**Differential%20expression%20between%20the%20discovered%20cell%20types**%2C%20following%20the%20compositional%20framework%20described%20in%20the%20%60scribe%60%20paper.%0A%0A%20%20%20%20If%20you%20are%20coming%20from%20a%20standard%20Scanpy%2FSeurat%20workflow%2C%20the%20mental%20model%20to%20carry%20in%20is%3A%0A%0A%20%20%20%20-%20Instead%20of%20%60scanpy.pp.normalize_total%60%20%E2%86%92%20%60scanpy.pp.log1p%60%20%E2%86%92%20%60scanpy.tl.leiden%60%20%E2%86%92%20%60scanpy.tl.rank_genes_groups%60%2C%20we%20fit%20**one%20generative%20model**%20of%20the%20raw%20UMI%20counts%20that%20simultaneously%20handles%20normalization%2C%20clustering%2C%20and%20uncertainty%20quantification.%0A%20%20%20%20-%20Instead%20of%20a%20single%20point%20estimate%20per%20gene%20per%20cluster%2C%20we%20get%20a%20**full%20posterior%20distribution**%20%E2%80%94%20so%20every%20downstream%20quantity%20(mean%20expression%2C%20fold%20change%2C%20differential-expression%20call)%20comes%20with%20a%20natural%20measure%20of%20uncertainty.%0A%20%20%20%20-%20Instead%20of%20p-values%2C%20we%20use%20**local%20false%20sign%20rates%20(lfsr)**%20%E2%80%94%20a%20Bayesian%20quantity%20with%20a%20direct%20probabilistic%20interpretation%3A%20%22the%20posterior%20probability%20that%20I%20have%20the%20direction%20of%20the%20effect%20wrong.%22%0A%0A%20%20%20%20For%20our%20running%20example%20we%20use%20a%20public%2010x%20Genomics%20dataset%20(%5Blink%5D(https%3A%2F%2Fwww.10xgenomics.com%2Fdatasets%2F50-percent-50-percent-jurkat-293-t-cell-mixture-1-standard-1-1-0))%20consisting%20of%20a%20**50%2F50%20mixture%20of%20Jurkat%20and%20HEK%20293T%20cells**.%20The%20true%20mixture%20proportions%20are%20known%2C%20which%20makes%20it%20an%20ideal%20pedagogical%20dataset%3A%20we%20can%20check%20whether%20the%20model%20recovers%20the%20expected%2050%2F50%20split%20and%20whether%20the%20differentially%20expressed%20genes%20look%20like%20the%20textbook%20markers%20of%20each%20cell%20type.%0A%0A%20%20%20%20Key%20facts%20about%20the%20dataset%3A%0A%0A%20%20%20%20-%20~3%2C400%20cells%0A%20%20%20%20-%20Sequenced%20on%20Illumina%20HiSeq%202500%20Rapid%20Run%20V2%2C%20~33%2C000%20reads%20per%20cell%0A%20%20%20%20-%2098%20bp%20read%201%20(transcript)%2C%208%20bp%20I5%20sample%20barcode%2C%2014%20bp%20I7%20GemCode%20barcode%2C%2010%20bp%20read%202%20(UMI)%0A%0A%20%20%20%20Let's%20begin%20by%20importing%20the%20necessary%20packages%20for%20our%20analysis.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20%23%20Import%20basic%20packages%0A%20%20%20%20from%20pathlib%20import%20Path%0A%20%20%20%20import%20pickle%0A%0A%20%20%20%20%23%20Import%20our%20main%20package%0A%20%20%20%20import%20scribe%0A%0A%20%20%20%20%23%20Import%20package%20to%20release%20GPU%20memory%0A%20%20%20%20from%20jax%20import%20clear_caches%0A%20%20%20%20import%20gc%0A%0A%20%20%20%20%23%20Import%20useful%20tools%0A%20%20%20%20import%20numpy%20as%20np%0A%20%20%20%20import%20scanpy%20as%20sc%0A%20%20%20%20import%20jax.numpy%20as%20jnp%0A%0A%20%20%20%20%23%20Import%20plotting%20packages%0A%20%20%20%20import%20matplotlib.pyplot%20as%20plt%0A%20%20%20%20import%20seaborn%20as%20sns%0A%0A%20%20%20%20%23%20Set%20our%20plotting%20style%20(totally%20optional)%0A%20%20%20%20scribe.viz.matplotlib_style()%0A%20%20%20%20return%20Path%2C%20clear_caches%2C%20gc%2C%20jnp%2C%20np%2C%20pickle%2C%20plt%2C%20scribe%2C%20sns%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Next%2C%20let's%20load%20the%20data%20into%20memory%20and%20perform%20a%20simple%20exploratory%20analysis.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(Path%2C%20scribe)%3A%0A%20%20%20%20%23%20Define%20data%20directory%0A%20%20%20%20data_dir%20%3D%20Path(%0A%20%20%20%20%20%20%20%20%22%2Fpath%2Fto%2F5050_jurkat-293t%2F%22%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Load%20the%20data%0A%20%20%20%20adata%20%3D%20scribe.data_loader.load_and_preprocess_anndata(%0A%20%20%20%20%20%20%20%20data_dir%2C%20return_jax%3DFalse%0A%20%20%20%20)%0A%0A%20%20%20%20adata%0A%20%20%20%20return%20adata%2C%20data_dir%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20first%20plot%20we%20always%20look%20at%20is%20the%20distribution%20of%20total%20UMI%20counts%20per%20cell.%20Beyond%20the%20usual%20quality-control%20role%2C%20this%20plot%20is%20also%20a%20modelling%20decision%3A%20if%20the%20library%20size%20varies%20a%20lot%20from%20cell%20to%20cell%2C%20then%20every%20cell%20is%20effectively%20sequenced%20at%20its%20own%20depth%20and%20we%20need%20the%20model%20to%20know%20that%20explicitly.%20In%20%60scribe%60%20that%20translates%20to%20enabling%20a%20**per-cell%20capture%20probability**%20(the%20%60variable_capture%3DTrue%60%20flag%20below)%2C%20which%20is%20the%20principled%20analogue%20of%20%22dividing%20by%20total%20counts%22%20in%20traditional%20pipelines%20%E2%80%94%20only%20here%20it%20is%20a%20parameter%20of%20the%20generative%20model%20with%20its%20own%20posterior%2C%20not%20a%20preprocessing%20step.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(adata%2C%20plt%2C%20sns)%3A%0A%20%20%20%20%23%20Initialize%20a%20figure%0A%20%20%20%20_fig%2C%20_ax%20%3D%20plt.subplots(figsize%3D(5%2C%204))%0A%20%20%20%20%23%20Plot%20the%20histogram%20of%20total%20UMI%20counts%0A%20%20%20%20sns.ecdfplot(adata.X.sum(axis%3D1)%2C%20ax%3D_ax)%0A%20%20%20%20%23%20Label%20the%20axes%0A%20%20%20%20_ax.set_xlabel(%22total%20UMI%20count%20per%20cell%22)%0A%20%20%20%20_ax.set_ylabel(%22ECDF%22)%0A%20%20%20%20%23%20Set%20y-axis%20limits%0A%20%20%20%20_ax.set_ylim(-0.01%2C%201.01)%0A%20%20%20%20%23%20Turn%20off%20legend%0A%20%20%20%20_ax.legend_.remove()%0A%20%20%20%20%23%20Show%20the%20plot%0A%20%20%20%20plt.show()%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20With%20nearly%20an%20order%20of%20magnitude%20difference%20between%20the%20smallest%20and%20largest%20library%20sizes%20in%20this%20dataset%2C%20there%20is%20no%20doubt%20that%20we%20will%20need%20to%20model%20variable%20capture%20probabilities%20on%20a%20per-cell%20basis.%0A%0A%20%20%20%20Given%20that%20we%20have%20no%20annotations%20for%20which%20cell%20is%20which%2C%20this%20is%20a%20perfect%20opportunity%20to%20show%20%60scribe%60's%20cell%20annotation%20capabilities%20via%20mixture%20models.%20However%2C%20we%20must%20emphasize%20that%20prior%20knowledge%20from%20biologists%20is%20incredibly%20valuable.%20In%20our%20experience%2C%20%60scribe%60%20is%20able%20to%20resolve%20cell%20types%20at%20a%20very%20coarse-grained%20level.%20Finer%20details%20require%20domain%20expertise.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Fitting%20a%20mixture%20model%20with%20variable%20capture%20probability%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20We%20now%20fit%20our%20first%20%60scribe%60%20model.%20Before%20looking%20at%20the%20code%2C%20let's%20unpack%20what%20each%20argument%20does.%0A%0A%20%20%20%20**What%20model%20are%20we%20fitting%3F**%20At%20its%20core%2C%20%60scribe%60%20models%20UMI%20counts%20with%20a%20**Negative%20Binomial%20distribution**%20%E2%80%94%20the%20same%20distribution%20that%20motivates%20DESeq2%20and%20edgeR%2C%20and%20one%20that%20can%20be%20derived%20from%20first%20principles%20from%20a%20two-state%20promoter%20model%20of%20transcription.%20Each%20gene%20has%20a%20parameter%20%24r_g%24%20that%20captures%20its%20**transcriptional%20burstiness**%20(genes%20that%20fire%20in%20big%20bursts%20have%20larger%20%24r_g%24)%2C%20and%20each%20cell%20has%20a%20parameter%20%24%5Chat%7Bp%7D_c%24%20that%20absorbs%20both%20intrinsic%20transcription%20and%20technical%20capture%20efficiency.%0A%0A%20%20%20%20**Why%20%60variable_capture%3DTrue%60%3F**%20Because%20the%20cells%20differ%20substantially%20in%20library%20size%2C%20we%20let%20every%20cell%20have%20its%20own%20capture%20probability%20%24%5Cnu_c%20%5Cin%20(0%2C%201)%24.%20Internally%20this%20produces%20a%20cell-specific%20effective%20success%20probability%0A%20%20%20%20%24%24%5Chat%7Bp%7D_c%20%3D%20%5Cfrac%7Bp%7D%7B%5Cnu_c%20%2B%20p(1%20-%20%5Cnu_c)%7D%2C%24%24%0A%20%20%20%20which%20is%20%60scribe%60's%20principled%20replacement%20for%20ad-hoc%20size-factor%20normalization.%0A%0A%20%20%20%20**Why%20%60parameterization%3D%22mean_odds%22%60%3F**%20Instead%20of%20fitting%20the%20raw%20Negative%20Binomial%20parameters%20%24(r_g%2C%20p)%24%2C%20we%20fit%20the%20**gene%20mean%20expression**%20%24%5Cmu_g%24%20together%20with%20an%20**odds%20ratio**%20%24%5Cphi%20%3D%20p%20%2F%20(1%20-%20p)%24.%20The%20two%20views%20are%20algebraically%20equivalent%20%E2%80%94%20%24r_g%20%3D%20%5Cmu_g%20%5Cphi%24%20and%20%24p%20%3D%201%2F(1%2B%5Cphi)%24%20%E2%80%94%20but%20they%20have%20very%20different%20inference%20behaviour.%20In%20raw%20%24(r_g%2C%20p)%24%20coordinates%2C%20many%20pairs%20produce%20the%20same%20mean%2C%20which%20creates%20a%20long%20banana-shaped%20ridge%20in%20the%20posterior%20that%20is%20painful%20to%20sample.%20The%20mean%E2%80%93odds%20view%20breaks%20that%20degeneracy%3A%20%24%5Cmu_g%24%20is%20tightly%20pinned%20by%20the%20data%20(it%20*is*%20the%20average)%2C%20and%20the%20odds%20ratio%20is%20left%20to%20pick%20up%20the%20residual%20variance.%0A%0A%20%20%20%20**Why%20%60n_components%3D2%60%3F**%20This%20turns%20the%20model%20into%20a%20**mixture%20model%20with%20two%20clusters**.%20We%20do%20*not*%20tell%20the%20model%20which%20cell%20is%20Jurkat%20and%20which%20is%20293T%20%E2%80%94%20instead%2C%20each%20component%20learns%20its%20own%20gene-expression%20profile%2C%20and%20every%20cell%20receives%20a%20posterior%20probability%20of%20belonging%20to%20each%20cluster.%20This%20is%20a%20%22soft%22%20clustering%3A%20a%20cell%20can%20be%2095%25%2F5%25%20or%2060%25%2F40%25%20between%20the%20two%20components%2C%20and%20that%20uncertainty%20is%20preserved%20in%20every%20downstream%20computation.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20%60early_stopping%60%20argument%20does%20exactly%20what%20it%20says%20on%20the%20tin%3A%20it%20watches%20the%20training%20objective%20and%20stops%20once%20the%20loss%20has%20plateaued%20for%20%60patience%60%20steps%20(%601000%60%20here).%20The%20%60checkpoint_dir%60%20path%20also%20tells%20%60scribe%60%20where%20to%20drop%20periodic%20checkpoints%20so%20that%20long%20runs%20can%20be%20resumed%20if%20they%20get%20interrupted.%20You%20can%20leave%20early%20stopping%20off%20for%20short%20experiments%3B%20it%20is%20most%20useful%20when%20you%20do%20not%20want%20to%20hand-tune%20the%20number%20of%20optimization%20steps.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(adata%2C%20data_dir%2C%20pickle%2C%20scribe)%3A%0A%20%20%20%20%23%20Define%20output%20directory%0A%20%20%20%20out_dir%20%3D%20data_dir%20%2F%20%22scribe_results%22%0A%20%20%20%20out_dir.mkdir(parents%3DTrue%2C%20exist_ok%3DTrue)%0A%0A%20%20%20%20%23%20Define%20parameterization%0A%20%20%20%20_parameterization%20%3D%20%22mean_odds%22%0A%0A%20%20%20%20%23%20Define%20output%20file%20path%0A%20%20%20%20_out_path%20%3D%20out_dir%20%2F%20f%22scribe_results_nbvcp_%7B_parameterization%7D.pkl%22%0A%0A%20%20%20%20if%20_out_path.exists()%3A%0A%20%20%20%20%20%20%20%20%23%20Load%20model%20from%20pkl%20file%0A%20%20%20%20%20%20%20%20with%20open(_out_path%2C%20%22rb%22)%20as%20f%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20results_nbvcp%20%3D%20pickle.load(f)%0A%20%20%20%20else%3A%0A%20%20%20%20%20%20%20%20%23%20Fit%20basic%20model%20to%20data%20with%20variable%20capture%20probability%20fit%20per%20cell%0A%20%20%20%20%20%20%20%20results_nbvcp%20%3D%20scribe.fit(%0A%20%20%20%20%20%20%20%20%20%20%20%20adata%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20variable_capture%3DTrue%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20parameterization%3D_parameterization%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20n_components%3D2%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20early_stopping%3D%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22enabled%22%3A%20True%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22patience%22%3A%201000%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22checkpoint_dir%22%3A%20str(_out_path.with_suffix(%22%22))%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20)%0A%20%20%20%20%20%20%20%20%23%20Save%20the%20fitted%20model%0A%20%20%20%20%20%20%20%20with%20open(_out_path%2C%20%22wb%22)%20as%20f%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20pickle.dump(results_nbvcp%2C%20f)%0A%0A%20%20%20%20results_nbvcp%0A%20%20%20%20return%20(results_nbvcp%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20To%20check%20whether%20training%20successfully%20converged%2C%20we%20examine%20the%20**%5BELBO%5D(https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FEvidence_lower_bound)%20loss**%20curve.%20The%20ELBO%20is%20the%20quantity%20that%20variational%20inference%20is%20optimizing%20under%20the%20hood%20%E2%80%94%20roughly%2C%20the%20negative%20log-evidence%20plus%20a%20gap%20that%20measures%20how%20well%20the%20approximate%20posterior%20matches%20the%20true%20one.%20For%20a%20sanity%20check%20we%20do%20not%20need%20to%20read%20it%20quantitatively%3A%20a%20clean%20run%20produces%20a%20curve%20that%20**drops%20quickly%20at%20first%20and%20then%20flattens%20out**%2C%20and%20that%20is%20exactly%20what%20we%20want%20to%20see.%20Spikes%2C%20slow%20drifts%20upward%2C%20or%20oscillations%20that%20never%20settle%20are%20warnings%20worth%20investigating.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(results_nbvcp%2C%20scribe)%3A%0A%20%20%20%20%23%20Plot%20ELBO%20loss%0A%20%20%20%20scribe.viz.plot_loss(results_nbvcp%2C%20figsize%3D(7%2C%203))%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20training%20loss%20has%20converged%2C%20but%20that%20alone%20does%20not%20prove%20the%20model%20describes%20the%20*counts*%20well.%20**Posterior%20predictive%20checks%20(PPCs)**%20address%20that%20directly%3A%20we%20simulate%20new%20UMI%20data%20from%20the%20fitted%20model%20and%20compare%20its%20distribution%20to%20the%20histogram%20of%20what%20was%20actually%20observed.%20If%20the%20model%E2%80%99s%20assumptions%20and%20fit%20are%20reasonable%2C%20synthetic%20replicates%20should%20track%20the%20real%20data%3B%20systematic%20shifts%20or%20heavy%20tails%20in%20the%20observations%20that%20the%20model%20never%20reproduces%20flag%20misspecification%E2%80%94regardless%20of%20how%20tidy%20downstream%20plots%20look.%0A%0A%20%20%20%20For%20mixture%20models%2C%20%60scribe.viz.plot_mixture_ppc_overview%60%20focuses%20on%20genes%20that%20are%20both%20**expressed%20across%20a%20wide%20dynamic%20range**%20and%20**strongly%20separated%20between%20components**.%20Genes%20are%20grouped%20into%20bins%20by%20overall%20median%20expression%20(so%20you%20see%20low-%2C%20mid-%2C%20and%20high-expression%20loci)%2C%20and%20within%20each%20bin%20the%20**largest%20component-to-component%20log%20fold-changes**%20(from%20the%20MAP%20estimates)%20are%20prioritized.%20Those%20genes%20are%20exactly%20where%20differential%20expression%20and%20mixing%20interact%3A%20they%20stress-test%20whether%20the%20mixture%E2%80%99s%20generative%20story%20matches%20the%20marginal%20count%20distributions%20you%20actually%20measured.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(adata%2C%20results_nbvcp%2C%20scribe)%3A%0A%20%20%20%20scribe.viz.plot_mixture_ppc_overview(%0A%20%20%20%20%20%20%20%20results_nbvcp%2C%0A%20%20%20%20%20%20%20%20adata%2C%0A%20%20%20%20%20%20%20%20n_genes%3D16%2C%0A%20%20%20%20%20%20%20%20n_rows%3D4%2C%0A%20%20%20%20%20%20%20%20figsize%3D(8%2C%208)%2C%0A%20%20%20%20%20%20%20%20n_samples%3D512%2C%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Nice%20%E2%80%94%20the%20overview%20PPC%20already%20gave%20us%20a%20sanity%20check%20on%20the%20**whole**%20library.%20Next%2C%20**%60scribe.viz.plot_mixture_ppc_comparison%60**%20is%20an%20easy%20way%20to%20sanity-check%20the%20**mixture%20itself**%3A%20are%20the%20two%20cell%20types%20behaving%20the%20way%20we%20think%3F%0A%0A%20%20%20%20You%20get%20the%20**same%20handful%20of%20interesting%20genes**%20as%20before%20(picked%20to%20span%20expression%20levels%20and%20to%20differ%20a%20lot%20between%20components).%20On%20each%20small%20panel%2C%20the%20**shaded%20regions**%20are%20synthetic%20count%20distributions%20from%20**component%201**%20and%20**component%202**%E2%80%94stacked%20on%20top%20of%20each%20other%20so%20you%20can%20compare%20them%20at%20a%20glance.%0A%0A%20%20%20%20The%20model%20also%20makes%20a%20simple%20call%20on%20every%20cell%3A%20%E2%80%9Cyou%E2%80%99re%20more%20likely%20**this**%20type%20than%20**that**%20type.%E2%80%9D%20Those%20are%20**MAP%20assignments**%20(each%20cell%20goes%20to%20its%20best-matching%20component).%20The%20**colored%20lines**%20are%20the%20**real%20UMI%20histograms**%20for%20just%20the%20cells%20assigned%20to%20each%20component%2C%20using%20colors%20that%20line%20up%20with%20the%20two%20predictive%20bands.%0A%0A%20%20%20%20So%20in%20plain%20language%3A%20*if%20our%20Jurkat%20vs.%20293T%20story%20is%20decent%2C%20each%20colored%20data%20trace%20should%20hug%20%E2%80%9Cits%E2%80%9D%20shaded%20region%20more%20than%20the%20other%20one%E2%80%94and%20the%20two%20shaded%20regions%20should%20sit%20where%20the%20two%20peaks%20actually%20live.*%20This%20view%20is%20only%20set%20up%20for%20a%20**two-component**%20fit%3B%20with%20more%20components%20you%E2%80%99d%20lean%20on%20the%20per-component%20plots%20instead.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(adata%2C%20clear_caches%2C%20gc%2C%20results_nbvcp%2C%20scribe)%3A%0A%20%20%20%20%23%20Clear%20gpu%20memory%0A%20%20%20%20clear_caches()%0A%20%20%20%20gc.collect()%0A%0A%20%20%20%20scribe.viz.plot_mixture_ppc_comparison(%0A%20%20%20%20%20%20%20%20results_nbvcp%2C%0A%20%20%20%20%20%20%20%20adata%2C%0A%20%20%20%20%20%20%20%20n_genes%3D16%2C%0A%20%20%20%20%20%20%20%20n_rows%3D4%2C%0A%20%20%20%20%20%20%20%20figsize%3D(8%2C%208)%2C%0A%20%20%20%20%20%20%20%20n_samples%3D512%2C%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20This%20plot%20clearly%20shows%20that%20the%20model%20is%20able%20to%20separate%20the%20two%20different%20cell%20types%20and%20predict%20their%20independent%20distributions.%0A%0A%20%20%20%20A%20second%20quick%20sanity%20check%20is%20available%20from%20the%20model%20itself.%20Because%20this%20dataset%20is%20advertised%20as%20a%20**50%2F50**%20mixture%2C%20the%20model's%20estimate%20of%20the%20**mixing%20weights**%20(the%20prior%20probability%20that%20a%20randomly%20chosen%20cell%20belongs%20to%20each%20component)%20should%20come%20out%20roughly%20balanced.%20Those%20weights%20live%20in%20the%20MAP%20dictionary%20under%20%60mixing_weights%60.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(np%2C%20results_nbvcp)%3A%0A%20%20%20%20%23%20Pull%20MAP%20mixing%20weights%20from%20the%20fitted%20mixture%20model.%20%20For%20a%20balanced%0A%20%20%20%20%23%2050%2F50%20mixture%20we%20expect%20both%20entries%20to%20be%20close%20to%200.5%20-%20anything%20very%0A%20%20%20%20%23%20lopsided%20would%20hint%20at%20a%20convergence%20problem%20or%20a%20genuinely%20unbalanced%0A%20%20%20%20%23%20dataset.%0A%20%20%20%20_mixing_weights%20%3D%20np.asarray(%0A%20%20%20%20%20%20%20%20results_nbvcp.get_map(descriptive_names%3DTrue)%5B%22mixing_weights%22%5D%0A%20%20%20%20)%0A%20%20%20%20print(%0A%20%20%20%20%20%20%20%20%22MAP%20mixing%20weights%20(component%200%2C%20component%201)%3A%22%2C%0A%20%20%20%20%20%20%20%20np.round(_mixing_weights%2C%203)%2C%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20That%20is%20remarkably%20close!%20We%20have%20one%20cell%20type%20with%20%E2%89%88%2053%25%20and%20the%20other%20one%20with%2047%25.%0A%0A%20%20%20%20For%20a%20**bigger-picture**%20view%2C%20we%20can%20also%20look%20at%20a%20**UMAP**%20scatter.%20That%20sounds%20like%20%E2%80%9Cthe%E2%80%9D%20global%20picture%20of%20the%20library%E2%80%94but%20it%20is%20worth%20being%20a%20little%20cautious.%0A%0A%20%20%20%20**UMAP%20is%20a%20visualization%2C%20not%20a%20ground-truth%20summary%20of%20biology.**%20It%20compresses%20thousands%20of%20genes%20into%20two%20dimensions%20with%20nonlinear%20geometry%2C%20neighborhood%20choices%2C%20and%20other%20knobs%2C%20so%20distances%20and%20cluster%20shapes%20are%20easy%20to%20over-interpret.%20It%20also%20depends%20on%20a%20standard%20single-cell%20recipe%20(filtering%2C%20normalization%2C%20HVGs%2C%20sometimes%20PCA)%20that%20is%20**not**%20the%20same%20objective%20the%20generative%20model%20was%20trained%20on.%0A%0A%20%20%20%20That%20said%2C%20for%20a%20**quick%20global%20sanity%20check**%2C%20it%20can%20still%20be%20useful.%20In%20%60scribe%60%2C%20the%20UMAP%20is%20fit%20on%20the%20**experimental**%20counts%20after%20a%20Scanpy-style%20pipeline%3B%20**synthetic**%20counts%20are%20then%20simulated%20from%20the%20fitted%20model%2C%20passed%20through%20the%20**same**%20preprocessing%20and%20PCA%2C%20and%20**projected**%20with%20the%20**same**%20UMAP%20transform%E2%80%94so%20you%20are%20asking%20a%20simple%20question%3A%20*do%20model-generated%20cells%20land%20in%20roughly%20the%20same%20regions%20as%20the%20real%20ones%3F*%20Treat%20overlaps%20as%20encouraging%20and%20gross%20mismatches%20as%20a%20prompt%20to%20dig%20deeper%20(starting%20with%20count-level%20checks%20like%20the%20PPCs%20above)%2C%20not%20as%20proof%20by%20themselves.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(adata%2C%20clear_caches%2C%20gc%2C%20results_nbvcp%2C%20scribe)%3A%0A%20%20%20%20%23%20Clear%20GPU%20memory%0A%20%20%20%20clear_caches()%0A%20%20%20%20gc.collect()%0A%0A%20%20%20%20scribe.viz.plot_umap(results_nbvcp%2C%20adata%2C%20figsize%3D(4%2C%204))%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20When%20we%20overlay%20synthetic%20draws%20from%20the%20fitted%20model%20on%20the%20same%20UMAP%20used%20for%20the%20observed%20cells%2C%20the%20two%20clouds%20line%20up%20surprisingly%20well.%20That%20agreement%20is%20worth%20pausing%20on%2C%20because%20the%20variational%20approximation%20we%20used%20is%20*mean-field*%3A%20each%20gene%E2%80%99s%20approximate%20posterior%20is%20updated%20as%20if%20the%20others%20were%20summarized%20by%20their%20averages%2C%20so%20the%20approximation%20does%20not%20explicitly%20model%20residual%20**gene%E2%80%93gene%20dependence**%20(beyond%20what%20is%20mediated%20by%20shared%20latent%20structure%20such%20as%20cell%20type%20or%20batch).%20In%20other%20words%2C%20we%20are%20prioritizing%20accurate%20**marginal**%20distributions%20per%20gene%20rather%20than%20a%20fully%20flexible%20**joint**%20distribution%20over%20all%20genes%20at%20once.%0A%0A%20%20%20%20For%20many%20downstream%20tasks%E2%80%94especially%20**gene-level%20differential%20expression**%2C%20where%20the%20goal%20is%20to%20rank%20genes%20by%20shifts%20between%20conditions%E2%80%94those%20marginals%20are%20often%20exactly%20what%20we%20need.%20A%20mean-field%20fit%20can%20therefore%20be%20a%20practical%20sweet%20spot%3A%20enough%20flexibility%20to%20capture%20strong%20biological%20signal%20in%20per-gene%20expression%20while%20keeping%20inference%20and%20sampling%20tractable%20on%20high-dimensional%20count%20data.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Let's%20start%20comparing%20the%20two%20components%20quantitatively.%20Because%20we%20chose%20the%20mean%E2%80%93odds%20parameterization%2C%20the%20most%20interpretable%20quantity%20per%20gene%20is%20%24%5Cmu_g%24%3A%20the%20**gene%20mean%20expression**%20in%20each%20component.%20Recall%20that%20%60scribe%60%20does%20not%20return%20a%20single%20number%20per%20parameter%20%E2%80%94%20it%20returns%20an%20**approximate%20posterior%20distribution**.%20That%20posterior%20is%20the%20thing%20that%20carries%20uncertainty.%0A%0A%20%20%20%20A%20convenient%20single-number%20summary%20is%20the%20**%5BMaximum%20a%20posteriori%20(MAP)%5D(https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMaximum_a_posteriori_estimation)%20estimate**%2C%20the%20mode%20of%20the%20posterior.%20Think%20of%20it%20as%20the%20Bayesian%20analogue%20of%20a%20maximum-likelihood%20point%20estimate%20%E2%80%94%20useful%20for%20scatter%20plots%20and%20quick%20comparisons%2C%20as%20long%20as%20we%20remember%20that%20the%20full%20posterior%20has%20more%20to%20say.%0A%0A%20%20%20%20We%20pull%20MAP%20estimates%20with%20%60results_nbvcp.get_map(descriptive_names%3DTrue)%60.%20The%20%60descriptive_names%3DTrue%60%20flag%20renames%20the%20internal%20parameters%20(%60mu%60%2C%20%60phi%60%2C%20...)%20to%20human-readable%20keys%20(%60mean_expression%60%2C%20%60odds%60%2C%20...).%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(results_nbvcp)%3A%0A%20%20%20%20%23%20%20Extract%20map%20from%20results%0A%20%20%20%20results_map%20%3D%20results_nbvcp.get_map(descriptive_names%3DTrue)%0A%0A%20%20%20%20results_map.keys()%0A%20%20%20%20return%20(results_map%2C)%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20We%20can%20the%20extract%20the%20parameters%20we%20care%20about.%20Let's%20look%20at%20the%20shape%20of%20the%20resulting%20array.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(results_map)%3A%0A%20%20%20%20results_mean%20%3D%20results_map%5B%22mean_expression%22%5D%0A%0A%20%20%20%20results_mean.shape%0A%20%20%20%20return%20(results_mean%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20This%20array%20has%20dimensions%20(%60n_components%60%2C%20%60n_genes%60).%20Thus%2C%20we%20can%20easily%20index%20each%20component%20and%20plot%20the%20comparison%20of%20these%20mean%20expression%20values.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(plt%2C%20results_mean)%3A%0A%20%20%20%20%23%20Initialize%20figure%0A%20%20%20%20_fig%2C%20_ax%20%3D%20plt.subplots(1%2C%201%2C%20figsize%3D(4%2C%204))%0A%0A%20%20%20%20%23%20Plot%20scatter%20of%20mean%20expression%20MAP%0A%20%20%20%20_ax.scatter(%0A%20%20%20%20%20%20%20%20results_mean%5B0%2C%20%3A%5D%2C%0A%20%20%20%20%20%20%20%20results_mean%5B1%2C%20%3A%5D%2C%0A%20%20%20%20%20%20%20%20s%3D3%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Add%20identity%20line%20as%20black%20dashed%20line%20from%20the%20min%20to%20the%20max%20of%20the%20array%0A%20%20%20%20_min_val%20%3D%20min(results_mean%5B0%2C%20%3A%5D.min()%2C%20results_mean%5B1%2C%20%3A%5D.min())%0A%20%20%20%20_max_val%20%3D%20max(results_mean%5B0%2C%20%3A%5D.max()%2C%20results_mean%5B1%2C%20%3A%5D.max())%0A%20%20%20%20_ax.plot(%0A%20%20%20%20%20%20%20%20%5B_min_val%2C%20_max_val%5D%2C%0A%20%20%20%20%20%20%20%20%5B_min_val%2C%20_max_val%5D%2C%0A%20%20%20%20%20%20%20%20color%3D%22black%22%2C%0A%20%20%20%20%20%20%20%20linestyle%3D%22--%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Label%20axis%0A%20%20%20%20_ax.set_xlabel(r%22component%20A%20mean%20expression%20(%24%5Cmu_%7B%5Ctext%7BMAP%7D%7D%5E%7B(A)%7D%24)%22)%0A%20%20%20%20_ax.set_ylabel(r%22component%20B%20mean%20expression%20(%24%5Cmu_%7B%5Ctext%7BMAP%7D%7D%5E%7B(B)%7D%24)%22)%0A%0A%20%20%20%20%23%20Set%20title%0A%20%20%20%20_ax.set_title(%22Mean%20expression%20comparison%22)%0A%0A%20%20%20%20%23%20Set%20to%20log-log%20scale%0A%20%20%20%20_ax.set_xscale(%22log%22)%0A%20%20%20%20_ax.set_yscale(%22log%22)%0A%0A%20%20%20%20%23%20Show%20plot%0A%20%20%20%20plt.show()%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20As%20expected%2C%20the%20two%20components%E2%80%99%20MAP%20means%20track%20each%20other%20closely%20on%20a%20log%E2%80%93log%20plot%2C%20with%20a%20clear%20tail%20of%20genes%20that%20shift%20between%20components.%20Those%20MAP%20values%20are%20only%20**summaries**%20of%20the%20posterior%2C%20however%E2%80%94each%20point%20is%20a%20single%20best%20guess%20for%20%24%5Cmu%24%20per%20gene%20and%20component.%20To%20show%20**uncertainty**%2C%20we%20will%20redraw%20the%20same%20comparison%20with%20vertical%20and%20horizontal%20%E2%80%9Cerror%20bars%E2%80%9D%20derived%20from%20the%20full%20posterior%20over%20%24%5Cmu%24.%0A%0A%20%20%20%20In%20%60scribe%60%2C%20approximate%20posterior%20draws%20are%20available%20from%20the%20fitted%20results%20object%20via%20**%60get_posterior_samples%60**.%20Here%20we%20request%20many%20draws%20(for%20example%20%60n_samples%3D512%60)%20and%20pull%20the%20**%60%22mu%22%60**%20tensor%20so%20we%20obtain%20an%20array%20of%20shape%20%60(n_samples%2C%20n_components%2C%20n_genes)%60%20matching%20the%20mean-expression%20parameters.%20We%20then%20summarize%20each%20gene%E2%80%99s%20marginal%20posterior%20along%20the%20sample%20axis%E2%80%94for%20instance%20with%20the%202.5th%20and%2097.5th%20**percentiles**%E2%80%94to%20obtain%20a%20**95%25%20credible%20interval**%20per%20gene%20and%20component.%20A%20**credible%20interval**%20answers%20a%20*Bayesian*%20question%3A%20%E2%80%9CGiven%20the%20model%20and%20data%2C%20with%2095%25%20posterior%20probability%2C%20%24%5Cmu%24%20lies%20between%20these%20two%20values.%E2%80%9D%20That%20is%20different%20from%20a%20**confidence%20interval**%20in%20*frequentist*%20inference%2C%20which%20is%20tied%20to%20a%20hypothetical%20repetition%20of%20the%20experiment%3A%20%E2%80%9CIf%20we%20repeated%20sampling%20many%20times%20and%20built%20intervals%20the%20same%20way%2C%20about%2095%25%20of%20them%20would%20cover%20the%20true%20fixed%20parameter.%E2%80%9D%20For%20communication%20in%20this%20notebook%2C%20%E2%80%9Cuncertainty%20bars%20from%20posterior%20quantiles%E2%80%9D%20is%20the%20right%20mental%20model.%0A%0A%20%20%20%20We%20attach%20those%20intervals%20to%20the%20same%20MAP%20scatter%20as%20asymmetric%20error%20bars%20(distance%20from%20MAP%20down%20to%20the%20lower%20quantile%20and%20up%20to%20the%20upper%20quantile%20on%20each%20axis).%0A%0A%20%20%20%20Let's%20first%20obtain%20the%20samples.%20We%20use%20the%20%60store-samples%3DFalse%60argument%20to%20avoid%20storing%20large%20tensors%20directly%20on%20the%20%60results_nbvcp%60%20object%20to%20save%20GPU%20memory.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(np%2C%20results_nbvcp)%3A%0A%20%20%20%20%23%20Generate%20posterior%20samples%0A%20%20%20%20mean_samples%20%3D%20np.array(%0A%20%20%20%20%20%20%20%20results_nbvcp.get_posterior_samples(n_samples%3D512%2C%20store_samples%3DFalse)%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22mu%22%0A%20%20%20%20%20%20%20%20%5D%0A%20%20%20%20)%0A%20%20%20%20return%20(mean_samples%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Now%2C%20we%20are%20ready%20to%20generate%20the%20plot%20with%20the%20corresponding%20errorbars.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mean_samples%2C%20np%2C%20plt%2C%20results_mean)%3A%0A%20%20%20%20%23%20Define%20quantiles%20for%20error%20bars%20(easy%20to%20edit)%0A%20%20%20%20_lower_quantile%20%3D%202.5%0A%20%20%20%20_upper_quantile%20%3D%2097.5%0A%0A%20%20%20%20%23%20Compute%20quantiles%20along%20the%20samples%20axis%20for%20both%20components%0A%20%20%20%20_mean_samples_lower%20%3D%20np.percentile(mean_samples%2C%20_lower_quantile%2C%20axis%3D0)%0A%20%20%20%20_mean_samples_upper%20%3D%20np.percentile(mean_samples%2C%20_upper_quantile%2C%20axis%3D0)%0A%0A%20%20%20%20%23%20The%20mean_samples%20array%20has%20shape%20(n_samples%2C%20n_components%2C%20n_genes)%0A%20%20%20%20%23%20Compute%20error%20bar%20lengths%20for%20each%20component%20and%20gene%0A%20%20%20%20_error_a%20%3D%20%5B%0A%20%20%20%20%20%20%20%20results_mean%5B0%2C%20%3A%5D%20-%20_mean_samples_lower%5B0%2C%20%3A%5D%2C%20%20%23%20lower%20errors%0A%20%20%20%20%20%20%20%20_mean_samples_upper%5B0%2C%20%3A%5D%20-%20results_mean%5B0%2C%20%3A%5D%2C%20%20%23%20upper%20errors%0A%20%20%20%20%5D%0A%20%20%20%20_error_b%20%3D%20%5B%0A%20%20%20%20%20%20%20%20results_mean%5B1%2C%20%3A%5D%20-%20_mean_samples_lower%5B1%2C%20%3A%5D%2C%20%20%23%20lower%20errors%0A%20%20%20%20%20%20%20%20_mean_samples_upper%5B1%2C%20%3A%5D%20-%20results_mean%5B1%2C%20%3A%5D%2C%20%20%23%20upper%20errors%0A%20%20%20%20%5D%0A%0A%20%20%20%20%23%20Initialize%20figure%0A%20%20%20%20_fig2%2C%20_ax2%20%3D%20plt.subplots(1%2C%201%2C%20figsize%3D(4%2C%204))%0A%0A%20%20%20%20%23%20Plot%20with%20errorbars%20for%20each%20gene%0A%20%20%20%20_ax2.errorbar(%0A%20%20%20%20%20%20%20%20results_mean%5B0%2C%20%3A%5D%2C%0A%20%20%20%20%20%20%20%20results_mean%5B1%2C%20%3A%5D%2C%0A%20%20%20%20%20%20%20%20xerr%3D_error_a%2C%0A%20%20%20%20%20%20%20%20yerr%3D_error_b%2C%0A%20%20%20%20%20%20%20%20fmt%3D%22o%22%2C%0A%20%20%20%20%20%20%20%20ms%3D3%2C%0A%20%20%20%20%20%20%20%20elinewidth%3D1%2C%0A%20%20%20%20%20%20%20%20capsize%3D0%2C%0A%20%20%20%20%20%20%20%20alpha%3D0.7%2C%0A%20%20%20%20%20%20%20%20color%3D%22C0%22%2C%0A%20%20%20%20%20%20%20%20ecolor%3D%22gray%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Add%20identity%20line%20as%20black%20dashed%20line%20from%20the%20min%20to%20the%20max%20of%20the%20array%0A%20%20%20%20_min_val2%20%3D%20min(results_mean%5B0%2C%20%3A%5D.min()%2C%20results_mean%5B1%2C%20%3A%5D.min())%0A%20%20%20%20_max_val2%20%3D%20max(results_mean%5B0%2C%20%3A%5D.max()%2C%20results_mean%5B1%2C%20%3A%5D.max())%0A%20%20%20%20_ax2.plot(%0A%20%20%20%20%20%20%20%20%5B_min_val2%2C%20_max_val2%5D%2C%0A%20%20%20%20%20%20%20%20%5B_min_val2%2C%20_max_val2%5D%2C%0A%20%20%20%20%20%20%20%20color%3D%22black%22%2C%0A%20%20%20%20%20%20%20%20linestyle%3D%22--%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Label%20axis%0A%20%20%20%20_ax2.set_xlabel(r%22component%20A%20mean%20expression%20(%24%5Cmu_%7B%5Ctext%7BMAP%7D%7D%5E%7B(A)%7D%24)%22)%0A%20%20%20%20_ax2.set_ylabel(r%22component%20B%20mean%20expression%20(%24%5Cmu_%7B%5Ctext%7BMAP%7D%7D%5E%7B(B)%7D%24)%22)%0A%0A%20%20%20%20%23%20Set%20title%0A%20%20%20%20_ax2.set_title(%22Mean%20expression%20comparison%20(with%20error%20bars)%22)%0A%0A%20%20%20%20%23%20Set%20to%20log-log%20scale%0A%20%20%20%20_ax2.set_xscale(%22log%22)%0A%20%20%20%20_ax2.set_yscale(%22log%22)%0A%0A%20%20%20%20%23%20Show%20plot%0A%20%20%20%20plt.show()%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Differential%20Expression%20Analysis%0A%0A%20%20%20%20Most%20single-cell%20DE%20workflows%20follow%20a%20familiar%20recipe%3A%20take%20raw%20UMI%20counts%2C%20divide%20by%20the%20total%20per%20cell%2C%20optionally%20log-transform%2C%20then%20run%20gene-by-gene%20tests.%20Normalization%20is%20treated%20as%20a%20practical%20preprocessing%20step%20%E2%80%94%20a%20way%20to%20make%20cells%20comparable%20%E2%80%94%20**before**%20the%20real%20statistics%20begin.%0A%0A%20%20%20%20%60scribe%60%20takes%20a%20different%20view.%20Rather%20than%20normalizing%20first%20and%20asking%20questions%20later%2C%20the%20model's%20generative%20structure%20*itself*%20tells%20us%20what%20we%20should%20be%20comparing.%0A%0A%20%20%20%20%23%23%23%20The%20argument%20in%20one%20paragraph%0A%0A%20%20%20%20Under%20the%20Negative%20Binomial%20model%20of%20transcript%20counts%2C%20the%20full%20profile%20of%20a%20cell%20factors%20exactly%20into%20two%20pieces%3A%20**(i)**%20the%20**total%20UMI%20count**%20(how%20deeply%20the%20cell%20was%20sequenced)%20and%20**(ii)**%20a%20**composition%20on%20the%20gene%20simplex**%20%E2%80%94%20the%20fractions%20%24%5Cunderline%7B%5Crho%7D%20%3D%20(%5Crho_1%2C%20%5Cldots%2C%20%5Crho_G)%24%20that%20say%20how%20the%20cell's%20transcription%20is%20allocated%20across%20genes.%20The%20depth%20piece%20is%20a%20nuisance%3B%20the%20composition%20piece%20is%20the%20biology.%20Crucially%2C%20these%20compositions%20are%20**latent%20parameters%20of%20the%20model**%2C%20not%20a%20preprocessing%20artefact%2C%20and%20%60scribe%60%20gives%20us%20a%20**posterior%20distribution**%20over%20them.%0A%0A%20%20%20%20So%20the%20differential-expression%20question%20becomes%3A%20*how%20do%20the%20two%20components'%20posterior%20compositions%20%24%5Cunderline%7B%5Crho%7D%5E%7B(A)%7D%24%20and%20%24%5Cunderline%7B%5Crho%7D%5E%7B(B)%7D%24%20differ%2C%20gene%20by%20gene%2C%20while%20properly%20respecting%20compositional%20geometry%3F*%0A%0A%20%20%20%20%23%23%23%20Why%20CLR%20coordinates%3F%0A%0A%20%20%20%20Compositions%20live%20on%20the%20simplex%2C%20not%20on%20the%20real%20line%2C%20so%20we%20cannot%20just%20subtract%20them.%20The%20standard%20fix%20is%20the%20**centered%20log-ratio%20(CLR)**%20transform%3A%0A%20%20%20%20%24%24%0A%20%20%20%20z_%7B%5Cmathrm%7BCLR%7D%2Cg%7D%20%5C%3B%3D%5C%3B%20%5Clog%20%5Crho_g%20%5C%3B-%5C%3B%20%5Cfrac%7B1%7D%7BG%7D%5Csum_%7Bj%3D1%7D%5E%7BG%7D%5Clog%20%5Crho_j.%0A%20%20%20%20%24%24%0A%20%20%20%20In%20words%2C%20each%20gene's%20log-fraction%20is%20compared%20to%20the%20**geometric%20mean%20across%20all%20genes**.%20Two%20features%20make%20this%20the%20natural%20choice%20here%3A%20it%20is%20**reference-free**%20(no%20gene%20is%20privileged%20as%20the%20denominator)%2C%20and%20it%20turns%20the%20multiplicative%20geometry%20of%20the%20simplex%20into%20**additive**%20contrasts%20%E2%80%94%20so%20differences%20between%20conditions%20behave%20like%20ordinary%20numbers.%0A%0A%20%20%20%20Because%20%60scribe%60%20carries%20a%20full%20posterior%20over%20%24%5Cunderline%7B%5Crho%7D%24%2C%20every%20posterior%20draw%20can%20be%20transformed%20into%20CLR%20coordinates%2C%20giving%20us%20a%20posterior%20over%20**CLR%20contrasts**%20%24%5CDelta_g%20%3D%20z_%7B%5Cmathrm%7BCLR%7D%2Cg%7D%5E%7B(A)%7D%20-%20z_%7B%5Cmathrm%7BCLR%7D%2Cg%7D%5E%7B(B)%7D%24.%20All%20differential-expression%20statistics%20below%20come%20from%20direct%20counting%20over%20these%20draws%20%E2%80%94%20no%20Gaussian%20assumptions%2C%20no%20p-values%2C%20just%20Monte%20Carlo%20summaries%20of%20a%20posterior%20we%20already%20have.%0A%0A%20%20%20%20%3E%20For%20the%20full%20derivation%20see%20the%20supplementary%20material%20of%20the%20paper.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20empirical%20comparison%20in%20%60scribe%60%20works%20by%20turning%20each%20posterior%20draw%20of%20the%20expression%20parameters%20into%20**relative%20abundances**%E2%80%94normalized%20compositions%20on%20the%20gene%20simplex%E2%80%94and%20then%20contrasting%20those%20compositions%20between%20conditions%20in%20**CLR**%20space.%20To%20do%20that%20reliably%20we%20first%20need%20a%20**large**%20set%20of%20posterior%20samples.%0A%0A%20%20%20%20We%20generate%20those%20draws%20with%20**%60get_posterior_samples%60**%20on%20the%20fitted%20results%20object%20and%20**keep%20them%20on%20the%20results**%20so%20the%20DE%20routines%20can%20read%20them%20directly.%20We%20set%20**%60convert_to_numpy%3DTrue%60**%20so%20the%20stored%20tensors%20are%20converted%20to%20plain%20**NumPy%20arrays**%20after%20sampling%2C%20which%20frees%20**GPU%20memory**%20immediately%20for%20later%20work.%20All%20downstream%20DE%20functions%20use%20array-backend%20dispatch%20and%20transparently%20run%20on%20the%20NumPy%2FSciPy%20stack%20when%20given%20NumPy%20inputs.%20Finally%2C%20because%20each%20mixture%20**component**%20is%20a%20separate%20condition%20in%20this%20comparison%2C%20we%20will%20extract%20component-specific%20views%E2%80%94either%20with%20**%60get_component(k)%60**%20or%20the%20shorthand%20**%60results%5B%3A%2C%20k%5D%60**%20(all%20genes%2C%20component%20%60k%60)%E2%80%94so%20the%20compositional%20contrast%20is%20computed%20between%20the%20right%20cell%20types.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(results_nbvcp)%3A%0A%20%20%20%20%23%20Define%20number%20of%20posterior%20samples%0A%20%20%20%20n_samples%20%3D%2010_000%0A%0A%20%20%20%20%23%20Sample%20posterior%20and%20store%20in%20results%20object%0A%20%20%20%20results_nbvcp.get_posterior_samples(%0A%20%20%20%20%20%20%20%20n_samples%3Dn_samples%2C%20convert_to_numpy%3DTrue%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Extract%20components%0A%20%20%20%20component0%20%3D%20results_nbvcp.get_component(0)%0A%20%20%20%20component1%20%3D%20results_nbvcp%5B%3A%2C%201%5D%0A%20%20%20%20return%20component0%2C%20component1%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Let's%20call%20%60scribe.de.compare%60%20with%20the%20default%20arguments.%20The%20ones%20that%20matter%20most%20for%20this%20notebook%20are%3A%0A%0A%20%20%20%20-%20**%60method%3D%22empirical%22%60**%3A%20Use%20the%20Monte%20Carlo%20DE%20path%3A%20build%20CLR%20contrasts%20from%20posterior%20samples%20and%20summarize%20with%20counting-based%20posterior%20quantities%20(not%20the%20closed-form%20Gaussian%20**%60%22parametric%22%60**%20path%2C%20nor%20**%60%22shrinkage%22%60**).%0A%0A%20%20%20%20-%20**%60n_samples_dirichlet%3D1%60**%3A%20Take%20**one**%20Dirichlet%E2%80%93multinomial%20draw%20on%20the%20gene%20simplex%20**per**%20posterior%20sample%20of%20the%20concentration%20parameters.%20Raising%20this%20fans%20out%20more%20simplex%20draws%20per%20posterior%20draw%20(more%20averaging%20of%20Monte%20Carlo%20noise%2C%20higher%20cost).%0A%0A%20%20%20%20-%20**%60batch_size%3D2048%60**%3A%20Process%20compositional%20sampling%20%2F%20CLR%20differencing%20in%20**chunks**%20of%20this%20many%20rows%20to%20cap%20memory%20use%20on%20large%20gene%20sets%20and%20many%20posterior%20samples.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(component0%2C%20component1%2C%20scribe)%3A%0A%20%20%20%20%23%20Run%20differential%20expression%20analysis%0A%20%20%20%20results_de%20%3D%20scribe.de.compare(model_A%3Dcomponent0%2C%20model_B%3Dcomponent1)%0A%0A%20%20%20%20results_de%0A%20%20%20%20return%20(results_de%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Before%20summarizing%20the%20differential%20expression%20results%2C%20we%20set%20three%20parameters%20that%20control%20how%20we%20threshold%20and%20report%20the%20analysis.%20They%20map%20directly%20onto%20choices%20a%20biologist%20already%20makes%20with%20tools%20like%20DESeq2%20or%20edgeR%20(expression%20floor%2C%20fold-change%20cutoff%2C%20FDR%20target)%2C%20with%20small%20but%20important%20Bayesian%20twists.%0A%0A%20%20%20%20-%20**%60MIN_MEAN_EXPRESSION%20%3D%205.0%60**%20%E2%80%94%20the%20minimum%20mean%20UMI%20count%20a%20gene%20needs%20to%20be%20analyzed%20individually.%20Genes%20that%20fall%20below%20this%20floor%20are%20**not%20dropped**%3A%20they%20are%20pooled%20into%20a%20single%20%22other%22%20pseudo-gene%20so%20the%20simplex%20still%20sums%20to%20one.%20Why%20bother%3F%20Because%20the%20CLR%20transform%20divides%20by%20the%20geometric%20mean%20of%20log-fractions%20across%20*all*%20genes%2C%20and%20the%20thousands%20of%20nearly-unexpressed%20genes%20in%20a%20typical%20dataset%20have%20very%20negative%2C%20very%20noisy%20log-fractions%20that%20make%20that%20geometric%20mean%20jitter%20from%20one%20posterior%20sample%20to%20the%20next.%20That%20jitter%20propagates%20into%20the%20CLR%20contrast%20and%20inflates%20uncertainty%20on%20the%20genes%20we%20actually%20care%20about.%20Aggregating%20low-expression%20genes%20into%20a%20single%20pooled%20component%20stabilizes%20the%20reference%2C%20and%20thanks%20to%20the%20**Dirichlet%20closure%20property**%20(summing%20a%20subset%20of%20Dirichlet%20components%20gives%20another%20Dirichlet)%20the%20aggregation%20is%20**exact**%20%E2%80%94%20no%20information%20loss.%20The%20full%20argument%20and%20formula%20are%20in%20the%20supplementary%20material.%0A%0A%20%20%20%20-%20**%60TAU%20%3D%20log(2)%60**%20%E2%80%94%20the%20minimum%20CLR%20shift%20we%20treat%20as%20biologically%20meaningful.%20In%20CLR%20space%2C%20%24%5Clog%202%24%20is%20roughly%20a%20doubling%20of%20a%20gene's%20fraction%20relative%20to%20the%20geometric%20mean%20of%20the%20expressed%20transcriptome%3A%20the%20Bayesian-compositional%20analogue%20of%20the%20%24%5Clog_2%24%20fold-change%20cutoff%20familiar%20from%20volcano%20plots.%0A%0A%20%20%20%20-%20**%60TARGET_PEFP%20%3D%200.05%60**%20%E2%80%94%20the%20desired%20**Posterior%20Expected%20False%20discovery%20Proportion**%2C%20the%20Bayesian%20analogue%20of%20the%205%25%20FDR%20that%20biologists%20routinely%20control%20for.%20Instead%20of%20ranking%20genes%20by%20p-values%20and%20applying%20Benjamini%E2%80%93Hochberg%2C%20%60scribe%60%20ranks%20them%20by%20the%20**local%20false%20sign%20rate%20(lfsr)**%20%E2%80%94%20the%20posterior%20probability%20that%20we%20have%20the%20*direction*%20of%20the%20effect%20wrong.%20An%20lfsr%20of%200.02%20means%20%22given%20the%20data%2C%20there%20is%20a%202%25%20chance%20the%20sign%20is%20wrong.%22%20The%20PEFP%20is%20the%20average%20lfsr%20across%20the%20declared%20DE%20set%2C%20and%20%60target_pefp%3D0.05%60%20is%20the%20threshold%20that%20keeps%20that%20average%20below%205%25.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20%60scribe%60%20DE%20interface%20is%20built%20around%20two%20chained%20steps.%20We%20first%20call%20%60results_de.set_expression_threshold(MIN_MEAN_EXPRESSION)%60%2C%20which%20applies%20the%20expression%20mask%20in-place%20on%20the%20results%20object%E2%80%94pooling%20filtered%20genes%20into%20the%20%22other%22%20pseudo-gene%20before%20any%20CLR%20coordinates%20are%20formed.%20We%20then%20call%20%60.to_dataframe(tau%3DTAU%2C%20target_pefp%3DTARGET_PEFP%2C%20metrics%3D%22clr%22)%60%20to%20convert%20the%20masked%20results%20into%20a%20tidy%20pandas%20DataFrame.%20The%20%60tau%60%20and%20%60target_pefp%60%20arguments%20are%20passed%20directly%20here%20rather%20than%20set%20globally%2C%20making%20it%20easy%20to%20explore%20different%20effect-size%20cutoffs%20or%20FDR%20levels%20without%20re-running%20the%20sampling.%20Setting%20%60metrics%3D%22clr%22%60%20tells%20the%20method%20to%20report%20gene-level%20statistics%20in%20CLR%20space%E2%80%94posterior%20mean%20shift%2C%20posterior%20standard%20deviation%2C%20lfsr%2C%20and%20the%20PEFP-controlled%20call%E2%80%94as%20opposed%20to%20other%20coordinate%20systems%20available%20in%20the%20package.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(jnp%2C%20results_de)%3A%0A%20%20%20%20%23%20Defie%20minimum%20mean%20expression%0A%20%20%20%20MIN_MEAN_EXPRESSION%20%3D%205.0%0A%20%20%20%20%23%20Define%20effective%20CLR%20change%20(TAU)%0A%20%20%20%20TAU%20%3D%20jnp.log(2.0)%0A%20%20%20%20%23%20Define%20target%20PEFP%0A%20%20%20%20TARGET_PEFP%20%3D%200.05%0A%0A%20%20%20%20%23%20Filter%20results%20based%20on%20expression%0A%20%20%20%20results_de_filtered%20%3D%20results_de.set_expression_threshold(%0A%20%20%20%20%20%20%20%20MIN_MEAN_EXPRESSION%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20CLR%20output%20stays%20on%20the%20built-in%20path.%0A%20%20%20%20df_clr%20%3D%20results_de_filtered.to_dataframe(%0A%20%20%20%20%20%20%20%20tau%3DTAU%2C%0A%20%20%20%20%20%20%20%20target_pefp%3DTARGET_PEFP%2C%0A%20%20%20%20%20%20%20%20metrics%3D%22clr%22%2C%0A%20%20%20%20)%0A%0A%20%20%20%20df_clr.head()%0A%20%20%20%20return%20TARGET_PEFP%2C%20TAU%2C%20df_clr%2C%20results_de_filtered%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20We%20now%20visualize%20how%20each%20gene%E2%80%99s%20**average%20compositional%20expression**%20differs%20between%20the%20two%20mixture%20components%20using%20%60scribe.viz.plot_de_mean_expression%60.%20With%20the%20default%20%60mode%3D%22clr%22%60%2C%20every%20point%20is%20one%20retained%20gene%20(after%20the%20expression%20mask%20and%20%60%22other%22%60%20pooling).%20The%20horizontal%20axis%20is%20the%20**posterior%20summary%20of%20mean%20CLR%20expression%20in%20component%20A**%3B%20the%20vertical%20axis%20is%20the%20same%20for%20**component%20B**.%20Those%20are%20the%20gene-level%20**CLR%20means**%20exported%20with%20the%20DE%20table%E2%80%94not%20raw%20NB%20%24%5Cmu%24%20on%20the%20natural%20scale%E2%80%94so%20the%20plot%20lives%20in%20the%20same%20**compositional**%20coordinate%20system%20used%20for%20the%20differential%20contrasts.%0A%0A%20%20%20%20Interpreting%20the%20geometry%20is%20straightforward.%20Genes%20whose%20average%20abundance%20is%20**similar**%20in%20the%20two%20cell%20types%20lie%20**near%20the%20dashed%20identity%20line**%20%24y%20%3D%20x%24%20on%20the%20**log%E2%80%93log**%20axes.%20Genes%20that%20fall%20**away%20from%20that%20line**%20are%20driven%20off-diagonal%20because%20one%20cell%20type%20allocates%20a%20systematically%20larger%20or%20smaller%20**fraction**%20of%20its%20transcriptome%20to%20that%20gene%20(relative%20to%20the%20CLR%20reference)%2C%20which%20is%20exactly%20the%20kind%20of%20pattern%20differential%20expression%20is%20meant%20to%20surface.%0A%0A%20%20%20%20The%20call%20passes%20**%60tau%3DTAU%60**%20and%20**%60target_pefp%3DTARGET_PEFP%60**%20through%20to%20the%20same%20dataframe%20export%20logic%20as%20%60to_dataframe%60%3A%20%60scribe%60%20combines%20your%20**minimum%20meaningful%20CLR%20shift**%20(%24%5Ctau%24)%20with%20a%20**posterior%20expected%20false%20discovery%20proportion%20(PEFP)**%20target%20to%20decide%20which%20genes%20are%20**called**%20differentially%20expressed.%20On%20the%20figure%2C%20genes%20that%20**fail**%20that%20joint%20call%20are%20drawn%20in%20**light%20gray**%3B%20genes%20that%20**pass**%20are%20highlighted%20in%20**dark%20red**%2C%20so%20the%20plot%20doubles%20as%20a%20quick%20**DE%20hit%20list**%20aligned%20with%20the%20thresholds%20you%20set%20above%E2%80%94without%20replacing%20the%20full%20table%20of%20posterior%20means%2C%20uncertainties%2C%20and%20local%20false%20sign%20rates%20stored%20in%20%60df_clr%60.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(TARGET_PEFP%2C%20TAU%2C%20results_de_filtered%2C%20scribe)%3A%0A%20%20%20%20scribe.viz.plot_de_mean_expression(%0A%20%20%20%20%20%20%20%20results_de_filtered%2C%20tau%3DTAU%2C%20target_pefp%3DTARGET_PEFP%2C%20figsize%3D(4%2C%204)%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20scatter%20of%20mean%20expression%20in%20CLR%20space%20is%20a%20useful%20first%20look%2C%20but%20each%20point%20is%20still%20a%20**summary**%20of%20a%20richer%20object%3A%20for%20every%20gene%2C%20%60scribe%60%20carries%20a%20**full%20posterior%20distribution**%20of%20CLR%20coordinates%20(and%20hence%20of%20the%20contrast%20between%20cell%20types)%20over%20posterior%20samples.%20The%20figures%20therefore%20emphasize%20interpretable%20point%20summaries%E2%80%94typically%20posterior%20means%E2%80%94while%20the%20underlying%20uncertainty%20lives%20in%20those%20draws.%0A%0A%20%20%20%20A%20familiar%20way%20to%20read%20the%20same%20two%20ingredients%E2%80%94**how%20large%20is%20the%20shift%3F**%20and%20**how%20confident%20are%20we%3F**%E2%80%94is%20a%20**volcano%20plot**.%20In%20classical%20**frequentist**%20differential%20expression%2C%20one%20common%20layout%20puts%20an%20**effect%20size**%20on%20the%20horizontal%20axis%E2%80%94often%20a%20log%20fold%20change%20between%20conditions%2C%20e.g.%20%24%5Clog_2%24%20fold%20change%E2%80%94and%20a%20**significance%20score**%20on%20the%20vertical%20axis%2C%20almost%20always%0A%20%20%20%20%24%24%0A%20%20%20%20y%20%3D%20-%5Clog_%7B10%7D%20p%2C%0A%20%20%20%20%24%24%0A%20%20%20%20or%20the%20same%20transformation%20applied%20to%20an%20**FDR-adjusted**%20%24p%24-value%20(e.g.%20Benjamini%E2%80%93Hochberg%20%24q%24).%20Small%20%24p%24-values%20become%20large%20%24y%24%2C%20so%20genes%20that%20are%20both%20**strong**%20(far%20from%20zero%20on%20%24x%24)%20and%20**statistically%20extreme**%20under%20the%20chosen%20test%20**float%20upward**%20on%20the%20plot%2C%20producing%20the%20characteristic%20%E2%80%9Cerupting%E2%80%9D%20shape.%0A%0A%20%20%20%20That%20frequentist%20picture%20answers%20a%20sampling-based%20question%3A%20*under%20a%20null%20hypothesis%20and%20a%20chosen%20error%20model%2C%20how%20surprising%20is%20this%20fold%20change%3F*%20The%20%24y%24-axis%20is%20tied%20to%20a%20**%24p%24-value**%3A%20the%20probability%2C%20under%20the%20null%2C%20of%20seeing%20a%20test%20statistic%20at%20least%20as%20extreme%20as%20the%20one%20observed.%0A%0A%20%20%20%20%60scribe%60%E2%80%99s%20**Bayesian**%20volcano%20keeps%20the%20**same%20visual%20grammar**%E2%80%94effect%20on%20%24x%24%2C%20evidence%20on%20%24y%24%E2%80%94but%20swaps%20the%20ingredients%20for%20quantities%20defined%20**with%20respect%20to%20the%20posterior**%3A%0A%0A%20%20%20%20-%20**Horizontal%20axis%20(%24x%24).**%20We%20plot%20the%20**posterior%20mean%20CLR%20contrast**%20between%20the%20two%20conditions%20(components)%2C%20i.e.%20the%20average%20shift%20in%20CLR%20coordinates%20for%20that%20gene.%20In%20compositional%20terms%20this%20is%20the%20natural%20analogue%20of%20a%20**fold-change%E2%80%93style%20contrast**%2C%20but%20expressed%20in%20**CLR%20geometry**%20on%20the%20simplex%20rather%20than%20as%20a%20raw%20ratio%20of%20normalized%20counts.%0A%0A%20%20%20%20-%20**Vertical%20axis%20(%24y%24).**%20Instead%20of%20%24-%5Clog_%7B10%7D%20p%24%2C%20we%20use%0A%20%20%20%20%20%20%24%24%0A%20%20%20%20%20%20y%20%3D%20-%5Clog_%7B10%7D(%5Cmathrm%7Blfsr%7D)%2C%0A%20%20%20%20%20%20%24%24%0A%20%20%20%20%20%20where%20**lfsr**%20is%20the%20**local%20false%20sign%20rate**%3A%20the%20posterior%20probability%20that%20the%20**sign**%20of%20the%20differential%20effect%20is%20wrong%20if%20we%20call%20that%20gene%20differentially%20expressed.%20Large%20%24y%24%20therefore%20means%20**small%20lfsr**%2C%20i.e.%20high%20posterior%20confidence%20in%20the%20**direction**%20of%20the%20shift%E2%80%94not%20a%20frequentist%20tail%20probability%20under%20a%20null.%0A%0A%20%20%20%20So%20the%20two%20volcano%20styles%20are%20**parallel%20in%20layout**%20but%20**different%20in%20meaning**%3A%20the%20traditional%20plot%20ranks%20genes%20by%20compatibility%20with%20a%20**null%20sampling%20model**%20via%20%24p%24-values%20(optionally%20FDR-adjusted)%3B%20%60scribe%60%20ranks%20genes%20by%20**posterior%20uncertainty%20about%20the%20direction%20of%20change**%20via%20the%20lfsr%2C%20with%20the%20CLR%20contrast%20as%20the%20effect-size%20coordinate.%20In%20both%20cases%2C%20points%20in%20the%20**upper%20corners**%20are%20the%20usual%20suspects%E2%80%94large%20shifts%20with%20strong%20statistical%20support%E2%80%94but%20the%20support%20is%20**frequentist%20significance**%20in%20one%20case%20and%20**Bayesian%20sign%20certainty**%20in%20the%20other.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(TARGET_PEFP%2C%20TAU%2C%20results_de_filtered%2C%20scribe)%3A%0A%20%20%20%20scribe.viz.plot_de_volcano(%0A%20%20%20%20%20%20%20%20results_de_filtered%2C%20tau%3DTAU%2C%20target_pefp%3DTARGET_PEFP%2C%20figsize%3D(4%2C%204)%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20volcano%20plot%20is%20great%20for%20getting%20a%20sense%20of%20the%20global%20structure%2C%20but%20at%20some%20point%20we%20want%20to%20look%20at%20**actual%20gene%20names**.%20Let's%20sort%20the%20DE%20table%20by%20**local%20false%20sign%20rate**%20(lowest%20first%20%E2%80%94%20these%20are%20the%20genes%20for%20which%20the%20direction%20of%20the%20effect%20is%20most%20confident)%2C%20and%20inspect%20the%20top%20hits%20in%20each%20direction.%0A%0A%20%20%20%20In%20the%20DataFrame%2C%20%60clr_delta_mean%60%20is%20the%20posterior%20mean%20CLR%20contrast%20(positive%20%E2%86%92%20higher%20in%20component%20A%3B%20negative%20%E2%86%92%20higher%20in%20component%20B)%2C%20%60clr_lfsr%60%20is%20the%20lfsr%2C%20and%20%60clr_is_de%60%20is%20the%20boolean%20DE%20call%20that%20combines%20%60tau%60%20and%20%60target_pefp%60.%20The%20mean%20expression%20in%20each%20component%20is%20available%20in%20%60clr_mean_expression_A%60%20%2F%20%60clr_mean_expression_B%60%20for%20context.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(TARGET_PEFP%2C%20TAU%2C%20df_clr)%3A%0A%20%20%20%20%23%20Sort%20the%20DE%20table%20by%20lfsr%20ascending%20so%20the%20most%20confident%20calls%20appear%0A%20%20%20%20%23%20first.%20%20We%20keep%20only%20DE-called%20genes%20(clr_is_de%3DTrue)%20because%20those%20are%0A%20%20%20%20%23%20the%20ones%20that%20pass%20both%20the%20effect-size%20(tau)%20and%20error-control%0A%20%20%20%20%23%20(target_pefp)%20thresholds%20simultaneously.%0A%20%20%20%20_de_hits%20%3D%20df_clr.loc%5Bdf_clr%5B%22clr_is_de%22%5D%5D.copy()%0A%0A%20%20%20%20%23%20Split%20by%20direction%20of%20the%20effect%20so%20we%20can%20display%20%22up%20in%20A%22%20and%0A%20%20%20%20%23%20%22up%20in%20B%22%20top%20hits%20separately%20-%20this%20is%20usually%20what%20a%20biologist%0A%20%20%20%20%23%20asks%20for%20when%20scanning%20for%20marker%20genes.%0A%20%20%20%20_top_up_in_A%20%3D%20(%0A%20%20%20%20%20%20%20%20_de_hits%5B_de_hits%5B%22clr_delta_mean%22%5D%20%3E%200%5D%0A%20%20%20%20%20%20%20%20.sort_values(%22clr_lfsr%22)%0A%20%20%20%20%20%20%20%20.head(15)%0A%20%20%20%20)%0A%20%20%20%20_top_up_in_B%20%3D%20(%0A%20%20%20%20%20%20%20%20_de_hits%5B_de_hits%5B%22clr_delta_mean%22%5D%20%3C%200%5D%0A%20%20%20%20%20%20%20%20.sort_values(%22clr_lfsr%22)%0A%20%20%20%20%20%20%20%20.head(15)%0A%20%20%20%20)%0A%0A%20%20%20%20print(%0A%20%20%20%20%20%20%20%20f%22Total%20DE-called%20genes%20at%20PEFP%3D%7Bfloat(TARGET_PEFP)%3A.2f%7D%2C%20%22%0A%20%20%20%20%20%20%20%20f%22tau%3D%7Bfloat(TAU)%3A.3f%7D%3A%20%7Blen(_de_hits)%7D%22%0A%20%20%20%20)%0A%20%20%20%20print(%22%5CnTop%2015%20genes%20up%20in%20component%20A%20(check%20Jurkat%20or%20293T%20markers)%3A%22)%0A%20%20%20%20print(%0A%20%20%20%20%20%20%20%20_top_up_in_A%5B%5B%22gene%22%2C%20%22clr_delta_mean%22%2C%20%22clr_lfsr%22%5D%5D.to_string(%0A%20%20%20%20%20%20%20%20%20%20%20%20index%3DFalse%0A%20%20%20%20%20%20%20%20)%0A%20%20%20%20)%0A%20%20%20%20print(%22%5CnTop%2015%20genes%20up%20in%20component%20B%3A%22)%0A%20%20%20%20print(%0A%20%20%20%20%20%20%20%20_top_up_in_B%5B%5B%22gene%22%2C%20%22clr_delta_mean%22%2C%20%22clr_lfsr%22%5D%5D.to_string(%0A%20%20%20%20%20%20%20%20%20%20%20%20index%3DFalse%0A%20%20%20%20%20%20%20%20)%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%23%20Reading%20the%20output%3A%20which%20component%20is%20which%3F%0A%0A%20%20%20%20Remember%20that%20we%20never%20told%20%60scribe%60%20which%20cells%20were%20Jurkat%20and%20which%20were%20293T%20%E2%80%94%20the%20labels%20%22component%20A%22%20and%20%22component%20B%22%20are%20assigned%20by%20the%20model%20in%20arbitrary%20order.%20Now%20that%20we%20have%20a%20ranked%20gene%20list%20we%20can%20actually%20**annotate**%20each%20component%20biologically.%0A%0A%20%20%20%20**Component%20A%20looks%20like%20Jurkat%20(T-cell%20line).**%20The%20up-regulated%20hits%20are%20dominated%20by%20canonical%20T-lymphocyte%20markers%3A%0A%0A%20%20%20%20-%20**LCK**%20%E2%80%94%20lymphocyte-specific%20protein%20tyrosine%20kinase%2C%20the%20classical%20Jurkat%20marker.%0A%20%20%20%20-%20**LEF1**%20%E2%80%94%20a%20lymphoid%20enhancer-binding%20factor%20essential%20for%20T-cell%20development.%0A%20%20%20%20-%20**CD1E**%2C%20**FYB**%2C%20**ITM2A**%20%E2%80%94%20all%20T-cell-associated%20surface%2Fsignaling%20proteins.%0A%20%20%20%20-%20**PSMB8**%2C%20**ISG15**%20%E2%80%94%20interferon-inducible%20%2F%20immunoproteasome%20genes%20typical%20of%20immune%20cells.%0A%0A%20%20%20%20This%20is%20exactly%20the%20kind%20of%20signature%20you%20would%20hope%20to%20see%20for%20a%20T-ALL-derived%20line.%0A%0A%20%20%20%20**Component%20B%20looks%20like%20HEK%20293T%20(embryonic%20kidney%20%2F%20neural-crest-like).**%20The%20hits%20here%20are%20a%20mix%20of%20kidney-epithelial%2C%20translation%2C%20and%20%E2%80%94%20interestingly%20%E2%80%94%20neuronal%20markers%3A%0A%0A%20%20%20%20-%20**CA2**%20(carbonic%20anhydrase%20II)%2C%20**CST3**%20(cystatin%20C)%2C%20**MDK**%20(midkine)%20%E2%80%94%20classical%20kidney%20%2F%20embryonic%20genes.%0A%20%20%20%20-%20**UCHL1**%2C%20**GAL**%2C%20**PCSK1N**%20%E2%80%94%20neuroendocrine%20%2F%20neural%20markers.%20This%20is%20a%20real%20biological%20curiosity%3A%20despite%20being%20called%20%22embryonic%20kidney%2C%22%20HEK%20293%20cells%20express%20a%20surprisingly%20neural%20gene%20program%2C%20and%20some%20analyses%20argue%20they%20are%20closer%20in%20lineage%20to%20a%20neural-crest%E2%80%93derived%20cell%20than%20to%20kidney%20epithelium.%20The%20model%20picks%20this%20up%20directly%20from%20the%20counts.%0A%0A%20%20%20%20**The%20cleanest%20sanity%20check%20of%20all%3A%20XIST.**%20XIST%20is%20the%20long%20non-coding%20RNA%20that%20coats%20the%20inactive%20X%20chromosome%20in%20female%20cells.%20It%20appears%20strongly%20up%20in%20component%20B%2C%20and%20is%20essentially%20absent%20in%20component%20A.%20Jurkat%20is%20a%20**male**%20(XY)%20cell%20line%3B%20HEK%20293T%20is%20**female**%20(XX).%20So%20the%20model%20has%20independently%20recovered%20a%20sex-chromosome-based%20cell-line%20difference%20that%20no%20single-cell%20clustering%20tool%20could%20%22cheat%22%20on%20%E2%80%94%20it%20comes%20straight%20from%20the%20biology.%0A%0A%20%20%20%20%23%23%23%20What%20this%20tells%20us%20about%20the%20model%0A%0A%20%20%20%20Without%20a%20single%20supervised%20label%2C%20%60scribe%60%20has%3A%0A%0A%20%20%20%201.%20Recovered%20the%20known%20**50%2F50**%20mixing%20proportion%20(check%20the%20MAP%20mixing%20weights%20above).%0A%20%20%20%202.%20Cleanly%20separated%20two%20cell%20lines%20in%20posterior%20space%2C%20confirmed%20by%20count-level%20PPCs.%0A%20%20%20%203.%20Returned%20a%20differential-expression%20table%20whose%20top%20hits%20are%20textbook%20markers%2C%20with%20**lfsr%20%3D%200.0**%20on%20every%20single%20one%20%E2%80%94%20i.e.%2C%20the%20posterior%20has%20essentially%20no%20ambiguity%20about%20the%20direction%20of%20the%20effect.%0A%0A%20%20%20%20%23%23%23%20Where%20to%20go%20next%0A%0A%20%20%20%20-%20**Rerun%20with%20more%20components**%20(%60n_components%3D3%2C%204%2C%20...%60)%20if%20the%20PPCs%20or%20UMAP%20suggest%20substructure%20within%20a%20single%20cluster%20(e.g.%2C%20cell-cycle%20phases%20within%20Jurkat).%0A%20%20%20%20-%20**Use%20%60scribe.de.compare(...%2C%20metrics%3D%22all%22)%60**%20to%20pull%20not%20only%20CLR-space%20contrasts%20but%20also%20biological%20log-fold-change%20(%60bio_lfc%60)%2C%20log-variance%20ratio%20(%60bio_lvr%60)%2C%20and%20Jeffreys%20divergence%20(%60bio_kl%60)%20summaries.%20The%20CLR%20view%20answers%20%22how%20do%20compositions%20differ%3F%22%3B%20the%20%60bio_*%60%20views%20answer%20%22how%20do%20the%20underlying%20NB%20distributions%20differ%3F%22%20%E2%80%94%20often%20complementary.%0A%20%20%20%20-%20**Carry%20the%20full%20posterior%20downstream.**%20The%20main%20thing%20a%20generative%20Bayesian%20model%20buys%20you%20over%20a%20traditional%20pipeline%20is%20not%20just%20a%20point%20estimate%20but%20calibrated%20**uncertainty**%20on%20every%20gene%2C%20every%20contrast%2C%20every%20pathway%20enrichment.%20That%20becomes%20invaluable%20the%20moment%20you%20move%20from%20toy%20datasets%20like%20this%20one%20to%20noisier%20real-world%20atlases.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20app.run()%0A