mcmc
The mcmc
module contains functions necessary to fit the statistical model via a Markov Chain Monte Carlo sampling-based method. MCMC is guaranteed to converge to the "true" posterior distribution. Therefore, for small number of barcodes (≈ 100-250) we recommend trying this approach. To scale the analysis, please check the vi module for variational inference.
BarBay.mcmc.mcmc_sample
— MethodFunction to sample the joint posterior distribution for the fitness value of all mutant and neutral linages given a time-series barcode count.
This function expects the data in a tidy format. This means that every row represents a single observation. For example, if we measure barcode i
in 4 different time points, each of these four measurements gets an individual row. Furthermore, measurements of barcode j
over time also get their own individual rows.
The DataFrame
must contain at least the following columns:
id_col
: Column identifying the ID of the barcode. This can the barcode sequence, for example.time_col
: Column defining the measurement time point.count_col
: Column with the raw barcode count.neutral_col
: Column indicating whether the barcode is from a neutral lineage
or not.
Keyword Arguments
data::DataFrames.AbstractDataFrame
: Tidy dataframe with the data to be
used to sample from the population mean fitness posterior distribution.
n_walkers::Int
: Number of walkers (chains) for the MCMC sample.n_steps::Int
: Number of steps to take.outputname::String
: String to be used to name the.jld2
output file.model::Function
:Turing.jl
model defining the posterior distribution from which to sample (seeBarBay.model
module). This function must take as the first four inputs the following:R̲̲::Array{Int64}
:: 2 or 3D array containing the raw barcode counts for all tracked genotypes. The dimensions of this array represent:- dim=1: time.
- dim=2: genotype.
- dim=3 (optional): experimental repeats
n̲ₜ::VecOrMat{Int64}
: Array with the total number of barcode counts for each time point (on each experimental repeat, if necessary).n_neutral::Int
: Number of neutral lineages.n_bc::Int
: Number of neutral lineages.
Optional Keyword Arguments
model_kwargs::Dict=Dict()
: Extra keyword arguments to be passed to themodel
function.id_col::Symbol=:barcode
: Name of the column indata
containing the barcode identifier. The column may contain any type of entry.time_col::Symbol=:time
: Name of the column indata
defining the time point at which measurements were done. The column may contain any type of entry as long assort
will resulted in time-ordered names.count_col::Symbol=:count
: Name of the column indata
containing the raw barcode count. The column must contain entries of typeInt64
.neutral_col::Symbol=:neutral
: Name of the column indata
defining whether the barcode belongs to a neutral lineage or not. The column must contain entries of typeBool
.rep_col::Union{Nothing,Symbol}=nothing
: Optional column in tidy dataframe to specify the experimental repeat for each observation.rm_T0::Bool=false
: Optional argument to remove the first time point from the inference. Commonly, the data from this first time point is of much lower quality. Therefore, removing this first time point might result in a better inference.sampler::Turing.Inference.InferenceAlgorithm=Turing.NUTS(0.65)
: MCMC sampler to be used.ensemble::Turing.AbstractMCMC.AbstractMCMCEnsemble=Turing.MCMCSerial()
:
Sampling modality to be used. Options are: - Turing.MCMCSerial()
- Turing.MCMCThreads()
- Turing.MCMCDistributed()
verbose::Bool=true
: Boolean indicating if the function should print partial progress to the screen or not.