stats
The stats
module includes useful functions to compute statistical characterizations of the data and the output results.
Posterior Predictive Checks
BarBay.stats.logfreq_ratio_popmean_ppc
— Functionlogfreq_ratio_popmean_ppc(df, n_ppc; kwargs)
Function to compute the posterior predictive checks (better called the posterior retrodictive checks) for the barcode log-frequency ratio for neutral lineages.
Model
The functional form that connects the barcode frequency at tie $t+1$ based on the frequency at time $t$ for neutral barcodes is of the form
\[ f_{t+1}^{(n)} = f_{t}^{(n)} \exp\left[ \left( - \bar{s}_t \right) \tau \right],\]
where $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. Our inference model assumes that
\[ \frac{f_{t+1}^{(n)}}{f_{t}^{(n)}} \sim \log-\mathcal{N}\left( - \bar{s}_t, \sigma^{(n)} \right),\]
where $\sigma^{(n)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution.
Arguments
df::DataFrames.DataFrame
: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables
in the mcmc chain defining the following variables: - population_mean_fitness
: Common pattern in all population mean fitness variables. - population_std_fitness
: Common pattern in all standard deviations estimates for the likelihood.
flatten::Bool=true
: Boolean indicating whether to flatten the output of multiple chain into a single column.
Returns
log(fₜ₊₁ / fₜ)= s⁽ᵐ⁾ - s̅ₜ::Array{Float64}
: Evaluation of the log frequency ratio posterior predictive checks at all times for each MCMC sample.
logfreq_ratio_popmean_ppc(chain, n_ppc; kwargs)
Function to compute the posterior predictive checks (better called the posterior retrodictive checks) for the barcode log-frequency ratio for neutral lineages.
Model
The functional form that connects the barcode frequency at tie $t+1$ based on the frequency at time $t$ for neutral barcodes is of the form
\[ f_{t+1}^{(n)} = f_{t}^{(n)} \exp\left[ \left( - \bar{s}_t \right) \tau \right],\]
where $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. Our inference model assumes that
\[ \frac{f_{t+1}^{(n)}}{f_{t}^{(n)}} \sim \log-\mathcal{N}\left( - \bar{s}_t, \sigma^{(n)} \right),\]
where $\sigma^{(n)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution.
Arguments
chain::MCMCChains.Chains
: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables
in the mcmc chain defining the following variables: - population_mean_fitness
: Common pattern in all population mean fitness variables. - population_std_fitness
: Common pattern in all standard deviations estimates for the likelihood.
model::Symbol=:lognormal
: Either:normal
or:lognormal
to indicate if the model used a normal or lognormal distribution for the likelihood. This is because when using a normal distribution, the nuisance parameters are sampled in log scale and need to be exponentiated.flatten::Bool=true
: Boolean indicating whether to flatten the output of multiple chain into a single column.
Returns
log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}
: Evaluation of the log frequency ratio posterior predictive checks at all times for each MCMC sample.
BarBay.stats.logfreq_ratio_bc_ppc
— Functionlogfreq_ratio_bc_ppc(df, n_ppc; kwargs)
Function to compute the posterior predictive checks for the barcode log-frequency ratio for adaptive mutants.
Model
The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form
\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]
where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that
\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]
where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.
Arguments
df::DataFrames.DataFrame
: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Keyword Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables
in the mcmc chain defining the following variables:
:bc_mean_fitness
: Variable defining the inferred mutant fitness values⁽ᵐ⁾
.:bc_std_fitness
: Variable defining the standard defining the inferred standard deviation on the likelihood functionσ⁽ᵐ⁾
.population_mean_fitness
: Common pattern in all population mean fitness variables.flatten::Bool=true
: Boolean indicating whether to flatten the output of
multiple chain into a single column.
Returns
log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}
: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.
logfreq_ratio_bc_ppc(chain, n_ppc; kwargs)
Function to compute the posterior predictive checks for the barcode log-frequency ratio for adaptive mutants.
Model
The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form
\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]
where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that
\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]
where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.
Arguments
chain::MCMCChains.Chains
: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables
in the mcmc chain defining the following variables:
:bc_mean_fitness
: Variable defining the inferred mutant fitness values⁽ᵐ⁾
.:bc_std_fitness
: Variable defining the standard defining the inferred standard deviation on the likelihood functionσ⁽ᵐ⁾
.population_mean_fitness
: Common pattern in all population mean fitness variables.flatten::Bool=true
: Boolean indicating whether to flatten the output of
multiple chain into a single column.
Returns
log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}
: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.
BarBay.stats.logfreq_ratio_multienv_ppc
— Functionlogfreq_ratio_mutienv_ppc(df, n_ppc; kwargs)
Function to compute the posterior predictive checks for the barcode log-frequency ratio for adaptive mutants.
Model
The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form
\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]
where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that
\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]
where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.
Arguments
df::DataFrames.DataFrame
: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
n_ppc::Int
: Number of samples to generate per set of parameters.envs::Vector{<:Any}
: List of environments in experiment. This is used to index the corresponding fitness from the chain. NOTE: The list of environments should be the name or corresponding label of the environemnt; the index is generated internally.
Optional Keyword Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables
in the mcmc chain defining the following variables:
:bc_mean_fitness
: Variable defining the inferred mutant fitness values⁽ᵐ⁾
.:bc_std_fitness
: Variable defining the standard defining the inferred standard deviation on the likelihood functionσ⁽ᵐ⁾
.population_mean_fitness
: Common pattern in all population mean fitness variables.flatten::Bool=true
: Boolean indicating whether to flatten the output of
multiple chain into a single column.
Returns
log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}
: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.
logfreqratiomutienvppc(chain, nppc; kwargs)
Function to compute the posterior predictive checks (better called the posterior retrodictive checks) for the barcode log-frequency ratio for neutral lineages.
Model
The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form
\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]
where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that
\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]
where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.
Arguments
chain::MCMCChains.Chains
: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables
in the mcmc chain defining the following variables: - population_mean_fitness
: Common pattern in all population mean fitness variables. - population_std_fitness
: Common pattern in all standard deviations estimates for the likelihood.
model::Symbol=:lognormal
: Either:normal
or:lognormal
to indicate if the model used a normal or lognormal distribution for the likelihood. This is because when using a normal distribution, the nuisance parameters are sampled in log scale and need to be exponentiated.flatten::Bool=true
: Boolean indicating whether to flatten the output of multiple chain into a single column.
Returns
log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}
: Evaluation of the log frequency ratio posterior predictive checks at all times for each MCMC sample.
BarBay.stats.freq_bc_ppc
— Functionfreq_bc_ppc(df, n_ppc; kwargs)
Function to compute the posterior predictive checks for the barcode frequency for adaptive mutants.
Model
The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form
\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]
where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that
\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]
where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.
Arguments
df::DataFrames.DataFrame
: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
- mutant initial frequency.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables in the mcmc chain defining the following variables::bc_mean_fitness
: Variable defining the inferred mutant fitness values⁽ᵐ⁾
.:bc_std_fitness
: Variable defining the standard defining the inferred standard deviation on the likelihood functionσ⁽ᵐ⁾
.bc_freq
: Variable defining the inferred initial frequency for the mutant.population_mean_fitness
: Common pattern in all population mean fitness variables.
model::Symbol=:lognormal
: Either:normal
or:lognormal
to indicate if the model used a normal or lognormal distribution for the likelihood. This is because when using a normal distribution, the nuisance parameters are sampled in log scale and need to be exponentiated.flatten::Bool=true
: Boolean indicating whether to flatten the output of multiple chain into a single column.
Returns
fₜ₊₁ = fₜ × exp(s⁽ᵐ⁾ - s̅ₜ)::Array{Float64}
: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.
freq_bc_ppc(chain, n_ppc; kwargs)
Function to compute the posterior predictive checks for the barcode frequency for adaptive mutants.
Model
The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form
\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]
where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that
\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]
where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.
Arguments
chain::MCMCChains.Chains
: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.
- mutant initial frequency.
n_ppc::Int
: Number of samples to generate per set of parameters.
Optional Arguments
param::Dict{Symbol, Symbol}
: Dictionary indicating the name of the variables in the mcmc chain defining the following variables::bc_mean_fitness
: Variable defining the inferred mutant fitness values⁽ᵐ⁾
.:bc_std_fitness
: Variable defining the standard defining the inferred standard deviation on the likelihood functionσ⁽ᵐ⁾
.bc_freq
: Variable defining the inferred initial frequency for the mutant.population_mean_fitness
: Common pattern in all population mean fitness variables.
flatten::Bool=true
: Boolean indicating whether to flatten the output of multiple chain into a single column.
Returns
fₜ₊₁ = fₜ × exp(s⁽ᵐ⁾ - s̅ₜ)::Array{Float64}
: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.
Naive estimates
BarBay.stats.naive_prior
— Functionnaive_prior(data; kwargs)
Function to compute a naive set of parameters for the prior distributions of the population mean fitness s̲ₜ
values, the nuisance parameters in the log-likelihood functions for the frequency ratios logσ̲ₜ
, and the log of the Poisson parameters for the observation model logΛ̲̲
This function expects the data in a tidy format. This means that every row represents a single observation. For example, if we measure barcode i
in 4 different time points, each of these four measurements gets an individual row. Furthermore, measurements of barcode j
over time also get their own individual rows.
The DataFrame
must contain at least the following columns:
id_col
: Column identifying the ID of the barcode. This can the barcode sequence, for example.time_col
: Column defining the measurement time point.count_col
: Column with the raw barcode count.neutral_col
: Column indicating whether the barcode is from a neutral lineage
or not.
rep_col
: (Optional) For hierarchical models to be build with multiple experimental replicates, this column defines which observations belong to which replicate.
Arguments
data::DataFrames.AbstractDataFrame
: Tidy dataframe with the data to be
used to sample from the population mean fitness posterior distribution.
Optional Keyword Arguments
id_col::Symbol=:barcode
: Name of the column indata
containing the barcode identifier. The column may contain any type of entry.time_col::Symbol=:time
: Name of the column indata
defining the time point at which measurements were done. The column may contain any type of entry as long assort
will resulted in time-ordered names.count_col::Symbol=:count
: Name of the column indata
containing raw counts per barcode. The column must contain entries of type<: Int
.neutral_col::Symbol=:neutral
: Name of the column indata
defining whether the barcode belongs to a neutral lineage or not. The column must contain entries of typeBool
.rep_col::Union{Nothing,Symbol}=nothing
: (Optional) Column indicating the experimental replicates each point belongs to.pseudocount::Int=1
: Pseudo counts to add to raw counts to avoid dividing by zero. This is useful if some of the barcodes go extinct.rm_T0::Bool=false
: Optional argument to remove the first time point from the
inference. Commonly, the data from this first time point is of much lower quality. Therefore, removing this first time point might result in a better inference.
Returns
prior_params::Dict
: Dictionary with two entries:s_pop_prior
: Mean value of the population mean fitness. NOTE: This naive empirical method cannot make statements about the expected standard deviation of the population mean fitness. It is up to the researcher to determine this value.logσ_pop_prior
: Mean value on the nuisance parameter for the log-likelihood functions on the log-frequency ratios. In other words, the mean for the (log)-Normal distribution on the frequency ratios. NOTE: This naive empirical method cannot make statements about the expected standard deviation of the population mean fitness. It is up to the researcher to determine this value. NOTE: Typically, one can use the same estimate for both the neutral and the mutant lineages.logλ_prior
: Mean value of the nuisance parameter for the Poisson observation model parameter. NOTE: This naive empirical method cannot make statements about the expected standard deviation of the population mean fitness. It is up to the researcher to determine this value.
BarBay.stats.naive_fitness
— Functionnaive_fitness(data; id_col, time_col, count_col, neutral_col, pseudocount)
Function to compute a naive estimate of mutant fitness data based on counts. The fitness estimate is computed as
⟨log(f⁽ᵐ⁾ₜ₊₁ / f⁽ᵐ⁾ₜ) - log(f⁽ⁿ⁾ₜ₊₁ / f⁽ⁿ⁾ₜ))⟩ = s⁽ᵐ⁾
Arguments
data::DataFrames.AbstractDataFrame
: Tidy dataframe with the data to be
used to infer the fitness values on mutants. The DataFrame
must contain at least the following columns: - id_col
: Column identifying the ID of the barcode. This can the barcode sequence, for example. - time_col
: Column defining the measurement time point. - count_col
: Column with the raw barcode count. - neutral_col
: Column indicating whether the barcode is from a neutral lineage or not.
Optional Keyword Arguments
id_col::Symbol=:barcode
: Name of the column indata
containing the barcode identifier. The column may contain any type of entry.time_col::Symbol=:time
: Name of the column indata
defining the time point at which measurements were done. The column may contain any type of entry as long assort
will resulted in time-ordered names.count_col::Symbol=:count
: Name of the column indata
containing the raw barcode count. The column must contain entries of typeInt64
.neutral_col::Symbol=:neutral
: Name of the column indata
defining whether the barcode belongs to a neutral lineage or not. The column must contain entries of typeBool
.rm_T0::Bool=false
: Optional argument to remove the first time point from the
inference. Commonly, the data from this first time point is of much lower quality. Therefore, removing this first time point might result in a better inference.
pseudocount::Int=1
: Pseudo count number to add to all counts. This is useful to avoid divisions by zero.
Returns
DataFrames.DataFrame
: Data frame with two columns:id_col
: Column indicating the strain ID.fitness
: Naive fitness estimate.
Miscellaneous statistical functions
BarBay.stats.build_getq
— FunctionFunction to build a full-rank distribution to be used for ADVI optimization. The code in this function comes from (Turing.jl tutorial
)[https://turinglang.org/v0.28/tutorials/09-variational-inference/]
Arguments
dim::Int
: Dimensionality of parameter space.model::DynamicPPL.model
: Turing model to be fit using ADVI.
Returns
Initialized distribution to be used when fitting a full-rank variational model.
BarBay.stats.matrix_quantile_range
— Functionmatrixquantilerange(quantile, matrix; dim=2)
Function to compute the quantile ranges of matrix matrix
over dimension dim
.
For example, if quantile[1] = 0.95
, this function returns the 0.025
and 0.975
quantiles that capture 95 percent of the entries in the matrix.
Arguments
quantile::Vector{<:AbstractFloat}
: List of quantiles to extract from the posterior predictive checks.matrix::Matrix{<:Real}
: Array over which to compute quantile ranges.
Keyword Arguments
dim::Int=2
: Dimension over which to compute quantiles. Default is 2, i.e. columns.
Returns
qs
: Matrix with requested quantiles over specified dimension.