Chapter I

Introduction

From Bio to Bit: How do Cells Sense the World Around Them?

Introduction

In his classic 1944 book What is Life?, Schrödinger brought to the attention of the scientific community what he thought were two of the biggest challenges we had ahead of us if we were to understand living systems in the same way we understand the electromagnetic field or the universal law of gravitation  [1]. The idea that living organisms could be “accounted for” by physics and chemistry brought with it a new agenda on what needed to be done to transition from a qualitative and descriptive study of the phenomena of life to a quantitative and predictive science in the spirit of the physical sciences. Since the publication of the book, there has been an enormous amount of progress on our understanding of living systems from a first-principles perspective, nevertheless, 75 years later Schrödinger questions are still as relevant and as vibrant as ever before  [2].

One of the defining features of living organisms at all scales is their capacity of gathering information from the environment, encode an internal representation of the state of the environment, and generate a response based on this information processing capacity. Researchers in the field of origins-of-life had gone as far as declaring that life emerged when chemical systems underwent a radical phase transition after which they were able to process and use information and free energy  [3]. So, although speculative, it is highly probable that the physical theory fulfilling Schrödinger’s vision of accounting for the phenomena of life will be the physics of systems capable of processing information  [4].

In this context, information does not take the generic concept of possessing practical knowledge about something. In this thesis, we use a precise mathematical definition of information  [5]. This formal definition makes information a metric worth quantifying and predicting in various biological contexts as theoretical studies suggest that natural selection might act on the ability of an organism to process information  [6]. Thus, working out the physical details of how it is that organisms sense the environment—this is, gather information about the state of the environment, encode such information in some shape or form within their physical boundaries, and take action based on this information—is at the core of the state-of-the-art research in biophysics  [7].

The present thesis is an effort towards this vision of understanding biological systems as information processing machines. Our object of study will be gene regulation in bacteria. This particular system has been the subject of study for microbiologists and molecular biologists for decades, and we have come to learn a lot about the microscopic mechanistic details of how bacteria turn on and off their transcriptional machinery  [8]. In particular, we will focus on what we think of as the “hydrogen atom” of gene regulation—the so-called simple-repression motif (more on that in the next section). In physics, calling something the hydrogen atom of \(X\) means that for the area of study \(X\), this “something” represents a system simple enough to be amenable to analytical models that standard mathematical methods can solve, but rich enough to capture the general features of the phenomena. This simple genetic circuit will allow us to write tractable mathematical models to guide our experimental efforts with the ultimate goal of testing our understanding of such systems when predicting how much information a bacterium can gather from the environment using this genetic module.

Professional biophysicists might wish to skip the rest of this chapter as we will lay the foundations needed for the rest of our enterprise. We will introduce the basics of gene expression modeling and the mathematical concept of information and work through every single physical and mathematical prerequisite needed for the rest of the thesis. The following chapters are structured as follows: Chapter 2 builds on a decade of understanding this hydrogen atom of gene regulation and expands our models’ predictive ability by including the effect of environmental effectors. This means that we will consider how gene regulation is affected by the presence of an extracellular inducer molecule. Chapter 3 will expand even further our predictive capacities by building a model capable of making predictions about the cell-to-cell variability inherent to all signaling systems working at the molecular scale. Chapter 4 serves as a Supporting Information section for Chapter 2, detailing every calculation and every inference. Likewise, Chapter 5 expands on Chapter 3, explaining every technical detail.

Gene Regulation as a Physics 101 Problem

As organisms navigate the challenges presented by the environment, they must constantly fight against the will of the second law of thermodynamics to bring them back to an equilibrium state. To face such challenges, cells are equipped with a toolkit of genes written in the language of A, T, C, and G of the genome. We can think of a typical bacteria genome with \(\approx 5\times 10^3\) genes as the blueprint to produce a repertoire of tools that allow cells to thrive under a myriad of circumstances that they face throughout their lives. Given the vast number of challenges that organisms face, there is constant pressure on every living system to use the right tools for the right circumstances. From cells in the fly embryo expressing different genes that will define their identity on the animal’s final body plan to a simple bacteria expressing the correct enzymes to process the available nutrients in the environment, all organisms are faced with the task of orchestrating the expression of the correct subset of genes at their disposal when trying to survive.

Our understanding of how organisms regulate their genes’ expression is still not as thorough as one might expect, given the effort that has gone into this question. Take, for example, E. coli—arguably the most well-characterized model organism—for which we know the regulatory scheme of less than 1/3 of its genes  [9]. For more complex organisms such as Drosophila, C. elegans, or even humans, we are even more hopeless on getting a holistic view of the regulatory landscape. Nevertheless, we would not be doing justice to the field’s significant advances if we were to pretend we are utterly ignorant about how gene regulation takes place in bacteria. There is a rich mechanistic understanding of how the transcriptional machinery takes the information contained in DNA and transcribes it into RNA  [8]. The relative simplicity of the process has inspired generations of biophysicists to try to write down minimal models that can describe and predict features of the process of gene regulation  [1012].

These modeling efforts come in two main flavors: equilibrium statistical mechanical models and kinetic models. In the following sections, we will introduce the necessary background for both approaches relevant to the rest of the thesis.

Minimal Model of Gene Expression

Let us begin our introduction to gene expression modeling with the simplest example. As shown in Fig. 1(A), we imagine that a gene promoter (the region of the gene where transcriptional regulation takes place) produces mRNA at a constant rate \(r_m\). Each mRNA can stochastically decay with a rate \(\gamma_m\). Our interest is to understand how the mRNA count \(m\) changes over time, given these two competing processes. For that, let us write the mRNA count at time \(m(t + \Delta t)\), where \(t\) is the time—which we are thinking of as being “right now”—and \(\Delta t\) is a little time step into the future. The mRNA count can then be predicted by computing \[ m(t + \Delta t) = m(t) + r_m \Delta t - (\gamma_m \Delta t) m(t), \label{eq:m_t_Delta_t} \] where we can think of \(r_m \Delta t\) as the probability of observing a single mRNA being produced in the time interval \([t, t + \Delta t]\) (\(\Delta t\) is so small that we neglect the possibility of seeing multiple mRNAs being produced), and \(\gamma_m \Delta t\) the probability of seeing a single mRNA being degraded. But since each mRNA has the same probability of being degraded, the total number of mRNAs that we would see decay in this time window would be the probability per mRNA times the total number of mRNAs. This is in contrast with the production of mRNA, which does not depend on the current number of mRNAs. If we send the term \(m(t)\) to the left-hand side of the equation and divide both sides by \(\Delta t\), we obtain \[ \frac{m(t + \Delta t) - m(t)}{\Delta t} = r_m - \gamma_m m(t). \] Upon taking the limit when \(\Delta t \rightarrow 0\), we see that the left-hand side is the definition of the derivative of the mRNA count with respect to time. We then obtain an ordinary differential equation of the form \[ \frac{dm}{dt} = r_m - \gamma_m m(t). \label{eq:dm_dt} \] Before even attempting to solve \(\ref{eq:dm_dt}\), we can perform a qualitative analysis of the dynamics  [13]. It is handy to plot the contribution of each of the components (production and degradation) to the derivative \(dm/dt\) as a function of \(m\). This is shown in Fig. 1(B), where the blue horizontal line \(r_m\) shows the production rate—which does not depend on \(m\), and the red line shows the degradation term \(m\gamma_m\) which scales linearly with \(m\). Notice that we do not include the negative sign for the degradation term, i.e., we are not plotting \(-m \gamma_m\). The point \(m_{ss}\) where both lines intersect represents the point where the production matches the degradation. For all values less than \(m_{ss}\), the production term is larger than the degradation, which means that for any value \(m < m_{ss}\), the derivative is positive (\(dm/dt > 0\)), so over time the system will accumulate more mRNA. The opposite is true for all values after \(m_{ss}\) where the degradation term is larger than the production term, implying that \(dm/dt < 0\). This means that for \(m > m_{ss}\), the system will tend to lose mRNA. These opposite trends point to the idea that \(m_{ss}\) must be called a stable fixed point of the dynamical system. This can schematically be seen at the bottom of Fig. 1(B). The arrowheads’ size indicates the system’s trend to move either left or right in \(m\). Since all arrows point at the special value, \(m_{ss}\), we can say that any small perturbation of the system will be dissipated as the system relaxes back to \(m_{ss}\).

This qualitative statement can be confirmed by solving Eq. \(\ref{eq:dm_dt}\). If we define the initial condition \(m(t=0) = m_o\) by separation of variables, we will obtain a solution of the form \[ m(t) = m_o e^{-\gamma_m t} + \frac{r_m}{\gamma_m} (1 - e^{-\gamma_m t}). \] In the limit when \(t \rightarrow \infty\), we can see that the steady-state solution is given by \[ m_{ss} = \frac{r_m}{\gamma_m}. \] Fig. 1(C) shows the time evolution of \(m\) for different initial values \(m_o\). We can see that indeed regardless of the initial mRNA count, the system relaxes exponentially to \(m_{ss} = r_m / \gamma_m\).

Figure 1: Minimal model of gene expression. (A) Schematic of the kinetics governing gene expression. mRNA is produced at a constant rate r_m independent of the current mRNA copy number. Degradation of each mRNA occurs at a rate \gamma_m. (B) Example of the qualitative analysis of the mRNA dynamics via a 1D phase-portrait. The differential equation governing the dynamics contains two terms: a constant production rate given by r_m, and a degradation rate \gamma_m m, which depends on the current mRNA count. The main plot shows each of the components in the m vs. dm/dt plot. Since r_m does not depend on the current number of mRNA, it gives a straight production rate as a function of m. The total degradation rate depends linearly on the mRNA copy number, giving a line with slope \gamma_m. When the two components are equal (bot lines crossing), we obtain the steady-state mRNA value m_{ss}. The bottom line shows a qualitative schematic of the flow of the system towards this steady state. The further m is from m_{ss}, the faster it moves towards this point as schematized by the arrows’ size. (C) Example of mRNA dynamics for different initial conditions. Over time, all curves converge to the steady-state mRNA value m_{ss}=r_m/\gamma_m. For this plot, \gamma_m = 1 and r_m/\gamma_m = 10. The Python code (ch1_fig01C.py) used to generate part (C) of this figure can be found on the thesis GitHub repository.

So far, our model assumes a simple constant transcription rate \(r_m\); let us expand this term a little further to include regulation by a transcriptional repressor further down the road. We know that for a transcriptional event to occur, the RNA polymerase (RNAP) must bind to the promoter region and undergo a series of irreversible steps, such as opening the double helix to initiate the DNA sequence’s copying into mRNA  [8]. But before these irreversible steps take place, there is a chance that the RNAP falls off the promoter. If we assume that these irreversible steps take place on a much longer timescale compared to the initial binding and unbinding of the RNAP on the promoter, we can separate the time scale and investigate them independently. In particular, we can write that mRNA production happens at a rate \[ \text{mRNA production} = r_m \cdot p_{\text{bound}}, \label{eq:mRNA_prod} \] where we split the original production term into two steps: \(p_{\text{bound}}\), the probability of finding an RNAP bound to the promoter, and \(r_m\) which captures all of the irreversible downstream steps that take place once the RNAP is engaged in a transcriptional event. A way to think about it—relevant to what I am doing right now as I type my thesis—is to think that the speed at which I type this document has to do with two things: the probability of me being actively working on these notes times the rate at which I type these notes once I engage in the activity. The reason this separation makes sense is that we can include the effect of the regulation by a transcriptional repressor as a reduction of the time (the probability) that the RNAP can be bound to the promoter. Furthermore, since we are assuming that the binding and unbinding of the RNAP happen at a timescale much faster than the downstream events, we can assume that this binding reaction is in quasi-equilibrium, for which we can use the powerful theoretical framework of statistical mechanics. Let us now delve into the basics of this physical theory.

The Unreasonable Effectiveness of Unrealistic Simplifications

In the preface of the textbook Molecular Driving Forces, Dill and Bromberg introduce the idea of Statistical Mechanics as the unreasonable effectiveness of unrealistic simplifications  [14]. Although one could make the case that all of physics follows this description, it is undoubtedly evident that statistical mechanics is a vivid example of how simple ideas can have profound consequences. Statistical mechanics can be defined as the theory that, upon assuming the atomic nature of matter, explains the phenomenology that classical thermodynamics established from the interactions of the microscopic components of a system  [14]. As with any other physical theory, statistical mechanics is built from a set of empirical facts that define “axioms” that we take to be true. In other words, as Feynman famously described to us: if we want to come up with a new law of nature, there is a simple recipe that we must follow:

  1. We guess the law. Literally. The most profound understanding of our physical reality we have comes from educated guesses made after a careful observation of nature.

  2. We compute the consequences of such a guess. That is why mathematical theories allow us to sharpen our statements about how we think nature works.

  3. We compare with experiments/observations. The scientific revolution came about when, after the dark ages, we finally learned it was okay to say “we don’t know.”

In such a simple statement, Feynman tells us, lies the key to science  [15]. For our purpose of understanding the basis of statistical mechanics, we will argue that Boltzmann’s law gives the main law upon which the field is founded \[ \frac{P(E_1)}{P(E_2)} = \frac{e^{-E_1 / k_BT}}{e^{-E_2 / k_BT}}. \label{eq:boltzmann_law} \] Let us unpack this equation. The main idea behind statistical mechanics is that macroscopic observables (temperature and pressure in classic examples) are emergent properties dictated by the dynamics of the system’s microscopic components. What Boltzmann’s law tells us is that the relative probability of a system in thermal equilibrium to be found in a particular microstate with energy \(E_1\) compared to being in a microstate with energy \(E_2\) is given by an exponential function of the negative energy of such microstate relative to the thermal energy \(k_BT\). The minus sign in the exponent comes from the fact that states with negative energies are more favorable by convention in physics. Thus, having a large negative energy has a high probability when taking the exponential of minus such negative number. To provide concrete examples of what a microstate can look like, Fig. 2(A) shows three molecular systems relevant to biology. In the first example, we have the classic ligand-receptor binding problem; here, we imagine that a solution can be discretized in space into a series of small boxes. In each of these boxes, one and only one ligand molecule can fit in. In principle, we can list all possible spatial arrangements of ligands. We could then calculate the relative likelihood of finding the system in any configurations as long as we can assign an energy value to each of them. The second example focuses on ligand-gated ion channels. In this particular system, we care about the ion channel’s state—either open or closed—and the ligands’ binding configuration. If the channel responds to the ligand’s concentration by changing its probability of gating, we can calculate using equilibrium statistical mechanics. Finally, the third example shows different configurations of a small patch of the cell membrane. All deformations of a membrane have energetic costs associated with them. By listing all possible membrane configurations, we can calculate the most likely shape of a membrane given the forces and stresses acting on it.

The macroscopic states that we observe can then be thought of as a coarse-graining of many microstates into a single macrostate. For example, in the ligand-receptor binding case, we rarely would care about the specific position of all the ligand molecules in the solution. What we would be interested in is whether or not the ligand is bound to the receptor. We can therefore define as our “macrostate” the particular configuration of the receptor as schematically shown in Fig. 2(B).

Figure 2: Boltzmann’s law and the definition of a micro- and macrostate. (A) Top panel: ligand-receptor binding microstates. Middle panel: ligand-gated ion channel microstates. Bottom panel: membrane patch deformations. (B) Schematic of the definition of a “macrostate.” In the ligand-receptor binding problem, we ignore all ligand molecules’ spatial configuration and focus on the receptor’s binding state.

If we want to know the likelihood of finding a particular system in any specific configuration, Boltzmann’s law (Eq. \(\ref{eq:boltzmann_law}\)) is then telling us a protocol we must follow:

  1. Enumerate all possible microstates in which the system can be found.

  2. Compute the energy of each of these microstates.

  3. Define the “macrostate” we care about by grouping all microstates that belong to the same energy.

  4. Compute the Boltzmann factor. This factor, sometimes called the Boltzmann weight, is defined as the exponential of the negative energy divided by the thermal energy, as indicated in Eq. \(\ref{eq:boltzmann_law}\).

To see this protocol in action, let us apply it to the calculation of \(p_{\text{bound}}\), the probability of finding an RNAP bound to the promoter. We will go through each of the protocol steps and build up the “unrealistic simplifications” that will allow us to make this calculation.

1. Enumerate possible microstates. We begin by making a drastic coarse-graining of the bacterial genome. For us, a genome is simply made out of boxes where the RNAP can bind. We imagine that there is a single site where RNAP can bind specifically—the promoter of interest. There are also \(N_{NS} \approx 5\times 10^6\) non-specific binding sites, one per basepair (bp) in the genome. This means that because of the sequence-dependent interactions between the RNAP molecule, and the DNA, the energy associated with specific binding to the gene promoter is more favorable than the rest of the genome. We ignore the fact that the RNAP footprint where it binds to the genome is roughly 30 bp. This assumption is valid if the number of available RNAP molecules is much smaller than the number of non-specific binding sites since it is improbable that two RNAPs would fall next to each other by pure chance. A useful analogy for this point is to think about sitting \(\sim \text{few}\times 10\) people on a large stadium with \(\sim 10^4\) seats. If the seats are chosen randomly, we do not need to worry about doing the sampling “without replacement” because the chances of two people ending up with the same seat number are negligible. We also ignore the possibility of RNAP not being bound to the genome. This assumption is supported by experimental evidence on a particular type of E. coli mutant that sheds lipid vesicles without segregating DNA into such vesicles. Mass spectrometry analysis on these “min-cells” has shown that there are no RNAP molecules to be found, implying that RNAPs are bound to DNA most if not all of the time  [11]. The exercise then consists of randomly choosing one box for each of the \(P\) polymerases available to bind. Fig. 3 shows in the first column two possible configurations of our coarse-grained genome.

2. Compute the energy for each microstate. Let us analyze the case where all \(P\) RNAP molecules are bound non-specifically to the genome. For simplicity, we assume that RNAP binds to all \(N_{NS}\) non-specific binding sites with the same affinity. We assign this energy to be \(\varepsilon_P^{(NS)}\). This assumption could be relaxed and we could assign instead a distribution of non-specific binding energies, as explored in  [16]. But for now, we do not have to worry about this complication. For the statistical mechanics’ protocol, the assignment of binding energies does not come from some quantum first-principled calculation or anything similar. We label the interaction of the RNAP and the rest of the genome with a single value, \(\varepsilon_P^{(NS)}\), that coarse-grains all of the hydrogen bonds and other effects that go into this physical process and gives an average energy. The calculation continues with this “labeled energy,” and, as we will see at the end, a very clean functional form emerges. Since we have \(P\) such polymerases bound non-specifically, the energy of any state with a similar configuration is then \(P \varepsilon_P^{(NS)}\) as shown in Fig. 3 second column, top row.

3. Define the “macrostate” we care about. In a sense, when we speak about macrostate, it does not necessarily mean something that we can macroscopically observe. What it means is that we group a bunch of states that we take to be functionally equivalent, as shown in Fig. 2(B). In our case, we only care about whether or not the RNAP is bound to our promoter of interest. The configuration of the rest of the background sites is irrelevant to our question. What this means in practice is that we must compute the degeneracy or multiplicity of our state. In other words, for the specific state shown in the first column/top row of Fig. 3, we know its Boltzmann weight. Eq. \(\ref{eq:boltzmann_law}\) tells us that the probability of this particular configuration takes the form \[ P_{\text{state}} \propto e^{-\beta P \varepsilon_P^{(NS)}}, \] where we define \(\beta \equiv (k_BT)^{-1}\). The probability of this binding configuration takes this form since the \(P\) RNAP molecules are bound non-specifically. But every single arrangement in which all RNAPs are bound non-specifically has the same Boltzmann weight. The question then becomes: in how many of such microstates can the system exist? This is a combinatorics question of the form: in how many different ways can I arrange \(P\) molecules into \(N_{NS}\) boxes? Which of course, the answer is \[ \text{\# states with all RNAPs bound non-specifically} = \frac{N_{NS}!}{P!(N_{NS} - P)!}, \] as shown in the third column of Fig. 3. This multiplicity can be simplified if we consider that \(N_{NS} \gg P\). To more easily visualize how to simplify this, let us for a second assume \(N_{NS} = 100\) and \(P = 3\). Given the definition of factorials, this means that \[ \frac{N_{NS}!}{(N_{NS} - P)!} = \frac{100\cdot 99\cdot 98\cdots97\cdots 2\cdot 1}{97\cdots2\cdot 1} = 100\cdot 99\cdot 98. \] Given this result, we can simply state that \(100\cdot 99\cdot 98 \approx 100^3\), only making a three percent error (\(100\cdot 99\cdot 98 / 100^3 \approx 0.97\)). Imagine that \(N_{NS}\) is in the order of \(10^6\). Then the error would become negligible. That is why, as shown in the third column of Fig. 3, we can approximate \[ \frac{N_{NS}!}{P!(N_{NS} - P)!} \approx \frac{N_{NS}^P}{P!}, \; \text{for }N_{NS} \gg P. \] For our other “macrostate,” we have the case where only one out of the \(P\) RNAPs is bound specifically for the promoter. We define the energy of this single RNAP specifically binding to the promoter as \(\varepsilon_{P}^{(S)}\). We assume that the other \(P-1\) RNAPs are bound non-specifically with the usual energy \(\varepsilon_{P}^{(NS)}\). The way to realize this state is then given by \[ \small \text{\# states with one RNAP bound specifically} = \frac{N_{NS}!}{(P - 1)!(N_{NS} - (P - 1))!} \approx \frac{N_{NS}^{P-1}}{(P-1)!}. \] What these Boltzmann weights mean is that for us, any state on which a single RNAP is bound to the promoter while the rest are bound non-specifically is equivalent. Therefore the probability of finding the promoter occupied by an RNAP would be of the form \[ p_{\text{bound}} \propto e^{-\beta \epsilon_1} + e^{-\beta \epsilon_2} + e^{-\beta \epsilon_3} + \cdots \] where \(\epsilon_i\) is the energy of the \(i^{\text{th}}\) state that has a single RNAP bound to the promoter. But we established that all of the \(\epsilon_i\) energies are the same. So instead of writing this long sum, we multiply the Boltzmann weight of a single state by the number of states with equivalent energy, i.e., we multiply it by the state’s multiplicity or degeneracy. The same logic applies for the states where none of the RNAPs are specifically bound to the promoter.

4. Compute the Boltzmann factor. The last step in the protocol is to follow the recipe indicated by Eq. \(\ref{eq:boltzmann_law}\). We exponentiate the energy, with the caveat we mentioned on the last point that this time we multiply by the multiplicity that we just computed. This is because we are lumping together all microstates into a single functional macrostate. So the Boltzmann weight for the unbound \(\rho_{\text{unbound}}\) macrostate is given by \[ \rho_{\text{unbound}} = \frac{N_{NS}^P}{P!} e^{-\beta P \varepsilon_P^{(NS)}}. \] For the bound state, we have \[ \rho_{\text{bound}} = \frac{N_{NS}^{P-1}}{(P-1)!} e^{-\beta \left(\varepsilon_P^{(S)} + (P - 1) \varepsilon_P^{(NS)}\right)}. \] For reasons that will become clear later in this chapter once we work with the entropy and derive the Boltzmann distribution, we know that to compute the probability of a specific microstate (or a macrostate), we simply take the Boltzmann weight of the microstate and divide by the sum of all of the other Boltzmann weights of the states available to the system. This sum of Boltzmann weights plays a very special role in statistical mechanics, and it is known as the partition function of the system. Therefore, to calculate \(p_{\text{bound}}\), we compute \[ p_{\text{bound}} = \frac{\rho_{\text{bound}}}{\rho_{\text{unbound}} + \rho_{\text{bound}}}. \] Substituting the Boltzmann weights we derived, we find \[ p_{\text{bound}} = \frac{ \frac{N_{NS}^{P-1}}{(P-1)!} e^{-\beta \left(\varepsilon_P^{(S)} + (P - 1) \varepsilon_P^{(NS)}\right)} }{ \frac{N_{NS}^{P-1}}{(P-1)!} e^{-\beta \left(\varepsilon_P^{(S)} + (P - 1) \varepsilon_P^{(NS)}\right)} + \frac{N_{NS}^P}{P!} e^{-\beta P \varepsilon_P^{(NS)}} }, \] an algebraic nightmare. We can simplify this expression enormously by multiplying the numerator and denominator by \(\rho_{\text{unbound}}^{-1}\). Upon simplification, we find the neat expression \[ p_{\text{bound}} = \frac{ \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} }{ 1 + \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} }, \label{eq:pbound_unreg} \] where \(\Delta\varepsilon_P \equiv \varepsilon_P^{(S)} - \varepsilon_P^{(NS)}\). This simple expression, known as the Langmuir isothermal binding curve, tells us that the more RNAPs available (larger \(P\)), or the stronger the promoter is (more negative \(\Delta\varepsilon_P\)), the more likely it is to find the promoter bound by an RNAP, and according to Eq. \(\ref{eq:mRNA_prod}\), the higher the mRNA production. In the next section, we connect this model to experimental measurements.

Figure 3: Statistical mechanics protocol for RNAP binding. On a discretized genome, we follow the statistical mechanics’ protocol to compute the Boltzmann weight of each of the relevant microstates. The P available RNAPs are assumed to have two binding configurations: one specific binding to the promoter of interest (with energy \varepsilon_P^{(S)}) and another non-specific to any of the N_{NS} non-specific binding sites (with energy \varepsilon_P^{(NS)}).

Figure 1 Theory in Gene Regulation

We began this section with a simple model for the dynamics of mRNA production and degradation. We then expanded our model to deconvolve the production term into the rate at which mRNA is produced by RNAP, and the probability of finding such RNAP bound to the promoter. To calculate this probability, we used the statistical mechanics’ protocol, which culminated in Eq. \(\ref{eq:pbound_unreg}\). So far, we are missing two important steps in our logical construction that will lead us to specific quantitative predictions that we can test experimentally:

  1. The inclusion of a regulatory scheme via a transcriptional repressor.

  2. The connection of the model with experimentally accessible quantities.

As hinted at earlier, for a transcriptional repressor, we imagine that the repressor’s effect on the regulation of the gene acts only through changes in \(p_{\text{bound}}\). To include the regulation, we add a series of microstates. Rather than having only \(P\) RNAP molecules to bind the genome, we also have \(R\) repressors that can bind specifically and non-specifically. Through the same statistical mechanics’ protocol as for the previous case, we can arrive at the Boltzmann weights shown for the three “macrostates” in Fig. 4(A). For the regulated case, we have that the probability of the promoter being bound by an RNAP takes the form \[ p_{\text{bound}}(R > 0) = \frac{ \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} }{ 1 + \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} + \frac{R}{N_{NS}} e^{-\beta \Delta \varepsilon_R} }, \label{eq:pbound_reg} \] where \(\Delta\varepsilon_R\) is the binding energy difference between the repressor binding to a specific binding site and a non-specific one. Although exciting and insightful, the quantities we have derived so far do not have an immediate quantitative prediction we can connect with experimental measurements. For example, for the regulated case, the steady-state mRNA count takes the form \[ m_{ss}(R > 0) = \frac{r_m}{\gamma_m} p_{\text{bound}}(R > 0) = \frac{r_m}{\gamma_m} \frac{ \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} }{ 1 + \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} + \frac{R}{N_{NS}} e^{-\beta \Delta \varepsilon_R} }. \] Determining \(r_m\) or \(\gamma_m\) directly from experiments, although possible, represents an enormous technical challenge. A convenient metric we can use instead is what we call the fold-change in gene expression. Fig. 4(B) shows a schematic representation of what we mean by the fold-change. This ratiometric quantity normalizes the expression level of a gene with regulation given by a transcriptional repressor by the expression level of the same gene in the absence of the regulation—via a knock-out of the repressor gene, for example. Mathematically this is defined as \[ \text{fold-change} \equiv \frac{m_{ss}(R > 0)}{m_{ss}(R = 0)}. \] This expression is convenient because upon taking the ratio of these steady-state mRNA counts, the ratio \(r_m / \gamma_m\) drops out of the equation. All we are left with is then the ratio of the \(p_{\text{bound}}\)s \[ \text{fold-change} = \frac{p_{\text{bound}}(R > 0)}{p_{\text{bound}}(R = 0)}. \] Substituting Eqs. \(\ref{eq:pbound_unreg}\) and \(\ref{eq:pbound_reg}\) results in \[ \text{fold-change} = \frac{ 1 + \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} }{ 1 + \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} + \frac{R}{N_{NS}} e^{-\beta \Delta \varepsilon_R} }. \] We appeal to some experimental understanding of the bacterial proteome composition  [1719]. RNAP copy number in E. coli is of the order \(P \sim 10^3-10^4\)  [19]. The binding affinity of these promoters is of the order \(\Delta\varepsilon_P \sim -2\pm 1\;k_BT\)  [11]. Along with the value of \(N_{NS}\sim 10^6\), this results in \[ \frac{P}{N_{NS}} e^{-\beta \Delta \varepsilon_P} \approx \frac{10^3}{10^6}e^{2.3} \approx \frac{10^3 \cdot 10}{10^6} \approx 10^{-2} \ll 1, \] the so-called weak-promoter approximation. For the repressor, we have that most repressors in E. coli are in the order of \(R \sim 10\)  [18]. Their binding affinities take values between \(\Delta\varepsilon_R \sim -15 \pm 5\; k_BT\)  [11]. These numerical values then give \[ \frac{R}{N_{NS}} e^{-\beta \Delta \varepsilon_R} \approx \frac{10}{10^6}e^{15} \approx \frac{10 \cdot 10^6}{10^6} \approx 10. \] If we implement these approximations, we can justify simplifying the fold-change equation to take the form \[ \text{fold-change} \approx \left( 1 + \frac{R}{N_{NS}} e^{-\beta\Delta\varepsilon_R} \right)^{-1}. \label{eq:fc} \] As shown in Fig. 4(C), this expression points directly at two experimental knobs that we can tune using molecular biology. We can modify the number of repressors by changing the ribosomal binding site sequence (RBS) of the repressor gene  [20]. What that means is that with a sequence-dependent manner, the ribosome translates mRNAs according to a specific region of the gene known as the RBS  [21]. Furthermore, we can change the repressor’s affinity for its binding site by mutating the binding site itself  [20]. Fig. 4(D) shows predictions of Eq. \(\ref{eq:fc}\) for different binding energies.

The model and the predictions presented here were worked out by Garcia and Phillips in a classic publication in 2011  [20]. In the next chapter, we build upon this theoretical scaffold to expand the predictive power of the model by including the allosteric nature of the transcription factor that allows the cells to change their genetic program upon the presence of an external molecule as a response to the environment.

Figure 4: Figure 1 theory in transcriptional regulation. (A) States and (normalized) weights for the simple repression motif. The promoter can be found in three states: 1) empty, 2) bound by an RNAP, 3) bound by a repressor. The same statistical mechanics’ protocol as in Fig. 3 can be used to derive the weights. (B) Schematic of the experimental determination of the fold-change in gene expression. The expression level of a regulated strain is normalized by the expression level of a strain with a repressor’s knock-out. (C) Experimentally accessible knobs predicted from the theoretical model. The number of transcription factors can be tuned by changing the amount of protein produced per mRNA. The binding energy of the repressor can be tuned by mutating the basepairs in the binding site. (D) Fold-change as a function of the repressor copy number for different binding energies. The Python code (ch1_fig04D.py) used to generate part (C) of this figure can be found on the thesis GitHub repository.

All Cells Are Equal, But Some are More Equal than Others

One of the great discoveries that came from the single-cell biology revolution where we began to measure individual cellular behavior rather than bulk observations, was the discovery of the intrinsic cell-to-cell variability in many aspects of biology, gene expression being the canonical example  [22]. This means that two cells with the same genome exposed to the same conditions will not express the same number of mRNAs and proteins of any specific gene. From a statistical physics perspective, this is not entirely “surprising” since we know that a system can be found in many different microstates as described in Fig. 2(A). What is different here is that a cell does not have an Avogadro number of mRNA (or, for that matter of anything) in it, making these fluctuations more relevant. If we think of fluctuations scaling as \(\sqrt{N}\), that means that for an \(N\) of \(\approx\) ten molecules or so, these variations can be significant in terms of the downstream cellular behavior. Cells have to cope with these physical limitations on precision, many times generating systems to actively buffer as much of the “noise” as possible  [23], other times using this intrinsic variability to their advantage  [24].

Figure 5: Chemical master equation in gene regulation. (A-B) Different points of view to understand the chemical master equation. (A) From the “particle” point of view, we imagine following the time trajectory of a single cell. The probability P(m, t) of finding a cell with m mRNAs at time t is then proportional to the time this cell spent with this number of molecules. (B) On the occupation number point of view, we imagine observing a large number of isogenic cells (different colors represent the individuality of each cell). The probability P(m,t) is then interpreted as the fraction of the cells representing such copy number exactly at time t. (C) Chemical master equations mathematize the idea of Markov processes. For the case of the unregulated promoter, the Markov process consists of a connection of an infinite number of discrete states that cells can transition between by producing or degrading mRNAs. (D) Spread-the-butter idea. Since probability is conserved, the central bar’s height changes slightly by having in- and outflow of probability mass from the contiguous bins. The Python code (ch1_fig05A.py) used to generate the plot in part (A) of this figure can be found on the thesis GitHub repository.

The central assumption behind the thermodynamic models of gene regulation that we studied in the last section is that the gene expression is proportional to the probability of finding an RNAP bound to the promoter  [11,25]. A consequence of this construction is that the probability space—the set of all possible events captured by the distribution—only looks at the state of the promoter itself, not at the state of the mRNA copy number. That is why thermodynamic models of this kind do not speak to the intrinsic cell-to-cell variability. For this, we need to use the so-called chemical master equation framework  [26]. There are two ways of thinking about the chemical master equation:

  1. The “particle” point of view.

  2. The occupation number point of view.

Depending on the context, we might want to use either of these approaches to write down the master equation for our problem of interest. Let us look into these two different ways of interpreting a master equation using our example of a cell producing mRNA. For the particle point of view, schematized in Fig. 5(A), we imagine following the mRNA copy number \(m\) of a single cell. The number of mRNAs in the cell change stochastically from time point to time point. On the one hand, there can be a transcriptional event that increases the number of mRNAs, and on the other hand, an mRNA can be degraded, decreasing the number of mRNAs. Suppose we imagine tracking this cell for a very long time. In that case, we can quantify the fraction of the time that the cell spent with zero mRNAs, one, two, and so on and from that, build the probability distribution \(P(m, t)\) of having \(m\) mRNA at time \(t\) (there is a subtle point here of the process being memoryless, but we will ignore this detail). The occupation number point of view, schematized in Fig. 5(B), takes a different perspective. For this, we imagine tracking not one but many cells simultaneously. Each cell can either produce or degrade an mRNA on a short time window, changing its total individual count. The probability \(P(m, t)\) is then built from counting how many cells out of the total have \(m\) mRNAs.

Regardless of how we think about the chemical master equation, both of these perspectives describe a Markov process. These are stochastic processes in which a system transitions between different states, but the transitions between such states are only governed by the transition rates between the states and the current state of the system. In other words, a Markov process keeps no track of the states it previously visited; the only factor that determines where is the system going to head is its current state, and the transition rates out of such state—that is why these are considered memoryless processes. Fig. 5(C) shows a schematic of what a Markov process looks like. The schematic of the unregulated promoter indicates that there are two possible reactions: an mRNA production with rate \(r_m\) and degradation with rate \(\gamma_m\). The Markov process for this simple model can then be represented as a series of nodes (representing the mRNA counts) connected with bi-directional arrows (representing the transition rates between states) indicating that the transitions can only take place between contiguous states.

In practice, the way we write down a chemical master equation is by a process christened by Professor Jane Kondev as “spread-the-butter.” The idea of spread the butter is that some probability mass (the analogous of the butter) is to be spread over the range of possible values (the equivalent of the toast) where probability mass migrates in and out of a particular bin keeping the total amount of probability to add up to one. The best way to explain this concept is by following the schematic in Fig. 5(D) and going through the math. Let us imagine we are keeping track of a particular mRNA value \(m\)—the chemical master equations are in reality, a system of many coupled equations, one for each mRNA count. We want to write down an equation that describes what is the probability of finding a cell with this particular count a small time window into the future \(P(m, t + \Delta t)\), where \(t\) represents the time “right now,” and \(\Delta t\) is a tiny time increment. The master equation is nothing more than a checks and balances notebook to keep track of all the flow of probability mass in and out of the bin we are interested in, as shown in Fig. 5(D). Informally we would write the equation as \[ P(m, t + \Delta t) = P(m, t) + \sum \left({\text{transitions from} \atop m'\text{ to }m}\right) - \sum \left({\text{transitions from} \atop m\text{ to }m'}\right), \label{eq:master_intuition} \] where we are describing the three main components that go into the equation for \(P(m, t + \Delta t)\):

  1. The probability of having \(m\) mRNA right now,

  2. the inflow of probability from other copy numbers \(m'\) via production and degradation,

  3. the outflow of probability from \(m\) to other copy numbers via production and degradation.

Taking our time window \(\Delta t\) to be sufficiently small, we can focus only on the two contiguous mRNA counts \(m-1\) and \(m+1\), and ignore the rest since jumps from further counts become increasingly improbable as the time step gets smaller. Fig. 5(D) shows the four in- and outflows that can happen. Let us rewrite Eq. \(\ref{eq:master_intuition}\) following this schematic. If a cell has \(m - 1\) mRNA and during the time window \(\Delta t\) produces one molecule, then it passes from state \(m - 1\) to state \(m\). This transition contributes to the inflow of probability mass by a factor \((r_m \Delta t) P(m-1, t)\), where we can think of \(r_m \Delta t\) as the probability of the transcription event taking place during the time window, and this multiplies the probability of having \(m - 1\) mRNA to begin with. A similar argument can be made for all transitions in and out of \(m\) depicted in Fig. 5(D), with the only difference that as in Eq. \(\ref{eq:m_t_Delta_t}\), the degradation of an mRNA molecule is proportional to the total number of molecules. The resulting equation for \(P(m, t + \Delta t)\) then takes the form \[ \begin{split} P(m, t + \Delta t) = &P(m, t) + \overbrace{(r_m \Delta t) P(m-1, t)}^{m-1 \rightarrow m} + \overbrace{(\gamma_m \Delta t) (m + 1) P(m + 1, t)}^{m+1 \rightarrow m}\\ &- \overbrace{(r_m \Delta t) P(m, t)}^{m \rightarrow m+1} - \overbrace{(\gamma_m \Delta t) m P(m, t)}^{m \rightarrow m-1} \end{split}. \label{eq:master_recipe} \] We send the first term on the right-hand side to the left, divide both sides by \(\Delta t\), and take the limit when \(\Delta t \rightarrow 0\). This gives us the master equation we were searching for \[ \frac{dP(m, t)}{dt} = r_m P(m - 1, t) + \gamma_m (m + 1) P(m + 1, t) - r_m P(m, t) + \gamma_m m P(m, t). \label{eq:master_simple} \] Eq. \(\ref{eq:master_simple}\) is not isolated. It represents an infinite-dimensional system of coupled ordinary differential equations (one for each mRNA copy number \(m\)). It can therefore be tricky to work directly with these types of equations. Instead, let us take Eq. \(\ref{eq:master_recipe}\) for a ride. With modern computational power, we can explicitly use this equation as a recipe on how to update an mRNA distribution numerically. Fig. 6 shows such numerical integration for a system with initially no mRNAs present. This could be achieved experimentally by having an inducible system, adding the inducer, and tracking the time evolution of the single-molecule mRNA counts inside cells. Fig. 6(A) presents a heatmap of such time evolution with time running on the vertical axis, while Fig. 6(B) presents specific snapshots. We can see that the distribution begins as a single peak (a delta function in the physics jargon) centered at zero mRNAs. The distribution then relaxes to a broader shape and remains the same after that. This suggests that the distribution converges to a steady-state. Let us compute this steady state distribution.

Figure 6: Time evolution of mRNA distribution. (A) Heat map of the time evolution of the mRNA distribution (Eq. \ref{eq:master_simple}) with P(m=0, t=0) = 1, i.e., a delta function at zero mRNAs at time zero. (B) Snapshots of the same time-evolving distribution at different time points. The Python code (ch1_fig06.py) used to generate the plot in part (A) of this figure can be found on the thesis GitHub repository.

In this system, where we have a series of state transitions as represented in Fig. 5(C), steady-state is reached when the flux of probability from two contiguous states is zero. In other words, when the probability distribution does not change over time anymore, the flow of probability from state \(m=0\) to state \(m=1\) should be the same as the reverse. The same condition applies to all other pairs of states. Mathematically this is expressed as \[ \overbrace{r_m P(0)}^{0 \rightarrow 1} = \overbrace{\gamma_m \cdot 1 \cdot P(1)}^{1 \rightarrow 0}, \] where we removed the time dependency from \(P(m, t)\) since we are at steady-state. Solving for \(P(1)\) results in \[ P(1) = \left(\frac{r_m}{\gamma_m} \right) P(0). \] The same condition applies between state \(m=1\) and \(m=2\), resulting in \[ \overbrace{r_m P(1)}^{1 \rightarrow 2} = \overbrace{\gamma_m \cdot 2 \cdot P(2)}^{2 \rightarrow 1}. \] Again, we can solve for \(P(2)\) and obtain \[ P(2) = \frac{1}{2}\left(\frac{r_m}{\gamma_m} \right) P(1). \] Substituting the solution for \(P(1)\) gives \[ P(2) = \frac{1}{2} \left(\frac{r_m}{\gamma_m} \right)^2 P(0). \] Let us do one more example to see the general pattern. Between \(m=2\) and \(m=3\), we have \[ \overbrace{r_m P(2)}^{2 \rightarrow 3} = \overbrace{\gamma_m \cdot 3 \cdot P(3)}^{3 \rightarrow 2}. \] Following the same procedure and substitutions results in \[ P(3) = \frac{1}{2\cdot 3} \left(\frac{r_m}{\gamma_m} \right)^3 P(0). \] Deducing the pattern from these examples, we can see that for any \(m\), we have \[ P(m) = P(0) \frac{\left( \frac{r_m}{\gamma_m}\right)^m}{m!}. \] All we have left is the unknown value \(P(0)\). To get at it, we use the fact that the distribution must be normalized, giving \[ \sum_{m=0}^{\infty} P(m)=1 \Rightarrow P(0)\sum_{m=0}^{\infty} \frac{\left(\frac{r_m}{\gamma_m}\right)^{m}}{m !}=1. \] We recognize the sum as the Taylor series for \(e^x\). This means that our constant \(P(0)\) is given by \[ P(0) = \frac{1}{\sum_{m=0}^{\infty} \frac{\left(\frac{r_m}{\gamma_m}\right)^{m}}{m !}} = e^{-r_m / \gamma_m}. \] Substituting this result, we find that the mRNA steady-state distribution is a Poission distribution with mean \(r_m/\gamma_m\), i.e., \[ P(m) = \frac{e^{- r_m / \gamma_m} \left( \frac{r_m}{\gamma_m}\right)^m}{m!}. \label{eq:mRNA_steady} \]

Entropy, Information, and the Math Behind the Bit

Central to the endeavor undertaken in this thesis is the idea that cells can process information from the environment to up or down-regulate their genes to generate an appropriate response to these external signals. Information as a concept is a very plastic term that we commonly use to explain having helpful knowledge to use to our advantage. Phrases such as “That person carries so much information in her brain. She truly knows everything!” point at this somewhat imprecise concept of what we mean by information.

In 1948, while working at Bell Labs, Claude Shannon shocked the world with his seminal work that would go to define the field of information theory  [27]. In his paper, Shannon gave us a precise mathematical definition of information. To understand Shannon’s logic better, we need to put it in the context that he was thinking about: communication systems such as the telephone or the telegraph. Although seemingly unrelated to our problem of cells sensing the environment, these systems are incredibly powerful in their conceptual and explanatory reach. For Shannon, the main problem of communication consisted of reproducing a message emitted at one point in space and time with fidelity at a different point. Usually, these messages carry with them meaning (otherwise, why would we even want to send such messages) by which we typically mean that the message “refers to or is correlated according to some system with certain physical or conceptual entities”  [27]. But for the task of engineering a reliable communication system, this meaning is irrelevant—in the same way that whatever the cell decides to do with the meaning of the signals obtained from the environment can be thought as irrelevant for the biophysics of how the signal is sensed.

As shown schematically in Fig. 7(A) from Shannon’s original work, a communication system essentially consists of five components:

  1. An information source which produces a message (or sequence of messages) to be communicated to the receiving terminal.

  2. A transmitter which takes the message, converts it into a suitable signal compatible with the communication channel.

  3. The channel that is the medium used to transmit the signal from the transmitter to the receiver.

  4. The receiver in charge of inverting the operation done by the transmitter, reconstructing the original message.

  5. The destination for whom the message is intended.

Fig. 7(B) shows an analogous schematic to Fig. 7(A) with the relevant components involved in the gene expression context that we focus on in this thesis. In our bacterial gene regulation model, the information source role is played by a small molecule’s environmental concentration. It is this signal that the cells are trying to measure and respond to by up-regulating the expression of a gene. This signal transmitter is the allosteric transcription factor whose conformation depends on the concentration of the small molecule. The receiver of the signal is the DNA promoter that orchestrates the protein expression, which plays the receiver’s role.

Figure 7: Abstract communication system. (A) Reproduced from Shannon’s original seminal work  [27]. The schematic shows an abstract communication system with all the components. (B) Adaptation of the Shannon communication system to the context of bacterial gene expression regulated by an allosteric transcription factor.

Having this setup in mind, the question becomes: how do we mathematically define what information is? This brings a somewhat subtle difference between two related terms that many time are incorrectly used interchangeably: Entropy and Information. Information allows the entity that possesses it to make predictions with accuracy better than random, while entropy is a quantification of how much we do not know  [5]. From these definitions, we see that having information, therefore, reduces our uncertainty, i.e., reduces the entropy. This means that for Shannon, the amount of information we have from a source is related to that source’s statistical structure and how much we can predict the source’s message given our knowledge of this statistical structure. Let us look at a concrete example: English text. We know that written and spoken language is not completely random. For a message to be meaningful, the choice of words has to come from a statistical structure that obeys the language’s grammar rules. The choice of letters within a word also follows a certain statistical structure. Let us look at the text shown in Fig. 8(A). This is arguably one of the most important and most beautiful pieces of prose ever put together by a human mind as it is the last paragraph of On the Origin of Species by Darwin. If we ignore the paragraph’s message and just quantify how often we find each of the 26 letters in the English alphabet, we obtain a distribution like the one shown in Fig. 8(B). This paragraph shows that the most common vowel is e, exactly as in English writ-large. This distribution \(P(x)\) is therefore not maximally random. In other words, if we were to put all letters in the paragraph in a hat and pick one letter at random, we could bet more money on the outcome being a letter e and make money over time given this knowledge of the structure of the distribution.

A maximally random distribution would be if all letters appeared equally frequent in the paragraph, such that betting on any letter coming out of the hat would give us equal chances of guessing right. If instead of looking at the distribution of individual letters, we look at pairs of letters, the distribution \(P(x, y)\) over the paragraph is shown in Fig. 8(C). Here we can see that, just as the letters were not completely random, the pairs of letters are also not random. For example, if we take the first letter of the pair to be t, we see that it is more commonly followed by the letter h. This implies that knowing that the first letter of the pair was t reduced our uncertainty of what character could come next. We would then say that knowing the first letter gave us information about the possible outcomes of the second letter. In the next section, we will follow Shannon’s original derivation to define both entropy and information mathematically.

Figure 8: The statistical structure of the English language. (A) Last paragraph of On the Origin of Species by Charles Darwin. This serves as a rather nice not-random text example. (B) Marginal distribution P(x) of all 26 letters and space. The size of the squares is proportional to how often each letter appears in the paragraph. (C) Joint distribution of pairs of characters P(x, y). All pairs of characters in (A) were counted to build this histogram. The x-axis shows the first letter while the y-axis shows the second. For simplicity in (B) and (C) all punctuation was ignored. The Python code (ch1_fig08.py) used to generate this figure can be found on the thesis GitHub repository.

Choice, Uncertainty, and Entropy

So far, our discussion about what entropy and information mean has been vague and not rigorous. To derive a formula to quantify these concepts, we need to get more mathematical. Let us assume that an information source (see Fig. 7(A)) produces elements of a message following a distribution \(\mathbf{p} = \{p_1, p_2, \ldots, p_n \}\), where each \(p_i\) is the probability of the \(i^{\text{th}}\) element. These elements could be letters, words, sentences, basepairs, concentrations of a small molecule, etc., of which we have \(n\) possibilities. What we are looking for is a metric \(H(\mathbf{p})\) that quantifies how much “choice” is involved in the selection of each element of the message. In other words, how uncertain we are about the message that the information source will produce at random? We demand our desired quantity \(H(\mathbf{p})\) to satisfy three reasonable conditions  [27]:

  1. \(H\) should be continuous in the \(p_i\)s. Different information sources might have slightly different distributions \(\mathbf{p}\), nevertheless \(H\) should still apply to all possible information sources.

  2. If all of the elements of the distribution are equally likely, i.e., \(p_i = 1/n\), then \(H\) should be a monotonic increasing function of \(n\). This means that the more options to choose from, the more uncertain we are about the possible outcome. For example, we are more uncertain about the outcome of a fair 6-sided die than of a fair coin just because of the number of possible outcomes from each of these “information sources.”

  3. If the act of choosing one of the possible \(n\) elements of our information source can be broken down into two successive choices, the original \(H\) should be the weighted sum of the individual \(H\)s. What this means is illustrated in Fig. 9(A) where we imagine having an information source with \(n=3\) choices, each with probabilities \(\mathbf{p} = \{ 1/2, 1/3, 1/6\}\), which gives \(H(1/2, 1/3, 1/6)\) for the left case. For the right case, we imagine first choosing between the upper and the lower path, and then, if the lower path is chosen, a second choice is made. This property then demands that \[ \overbrace{H(1/2, 1/3, 1/6)}^{\text{single choice}} = \overbrace{H(1/2, 1/2)}^{\text{first choice}} + \overbrace{\frac{1}{2} H(1/3, 1/6)}^{\text{second choice}}. \] Another way to think about this property is that we want our metric of uncertainty \(H\) to be additive.

Figure 9: Shannon’s theorem. (A) One of the properties of a reasonable metric for uncertainty is that we can partition choices into multiple steps, and the resulting uncertainty should remain the same. (B) Example of coding functions E. The English alphabet can be converted into Morse code. Amino acids can be encoded in codons. (C) Partitioning of 2^3 equally likely choices into three decision steps, each with two choices. Eight different amino acids can be selected using two schemes: 1) each of the eight codons is chosen at random with equally likely chances, or 2) the codon is built by choosing one basepair at the time. (D) Partitioning of unequal choices. Given the redundancy of the genetic code, for equally likely codons, the resulting amino acid has different probabilities being chosen.

We will now prove that the only functional form that satisfies all these three properties is given by \[ H(\mathbf{p}) = - K \sum_{i=1}^n p_i \log p_i, \] where \(K\) is a constant having to do with the units (choice of the logarithm base). To prove this, we will follow Shannon’s original work. We imagine the problem of encoding a message. For example, imagine encoding a message from the English alphabet into Morse code, or a protein sequence into the corresponding mRNA sequence, as schematically depicted in Fig. 9(B). In there, we take letters in the English alphabet (SOS for the English alphabet, MGF for the protein), run it through an encoding function \(E\) and obtain the message (…- - -… for the Morse code, AUGGGCUUC for the mRNA). This process of encoding can be thought of as taking a message \(m_x\) written in an alphabet \(\mathcal{X} = \{x_1, x_2, \ldots, x_n \}\), (where \(n\) is 26 for the English alphabet, and 20 for the number of amino acids) and converting it into a message \(m_y\) written in a different alphabet \(\mathcal{Y} = \{y_1, y_2, \ldots, y_m \}\) (where \(m=2\) for Morse code since we only have dots and dashes, and \(m=4\) for the mRNA with 4 possible nucleotides). The encoding function \(E: \mathcal{X}^r \rightarrow \mathcal{Y}^t\) takes a message of length \(r\) (for our exmaples \(r=3\)) and translates it into a message of size \(t\) (in our examples \(t=9\)) such that we then have \[ m_y = E(m_x). \] Obviously, the larger the message \(m_x\) we want to encode, the larger the corresponding message \(m_y\) will be. Therefore we have that \[ L(m_y) \propto L(m_x), \label{eq:length_proportionality} \] where \(L(\cdot)\) is a function that counts the number of characters in a message. An essential difference between both of the examples in Fig. 9(B) is that, for the English to Morse code case, the number of dots and dashes for different letters is different (e\(\rightarrow\)., x\(\rightarrow\)-..-). Meanwhile, for the amino acid to codon case, every single codon has the same length. Let us focus for now on this second coding scheme where every character from alphabet \(\mathcal{X}\) is encoded with the same number of characters from alphabet \(\mathcal{Y}\). We have then \(L(m_x) = r\) and \(L(m_y) = t\). Let us call \(k\) the proportionality constant from Eq. \(\ref{eq:length_proportionality}\) such that \[ L(m_y) = k L(m_x). \label{eq:length_fn} \] The number of messages of size \(r\) that can be encoded with the alphabet \(\mathcal{X}\) is given by \(n^r\)—because we have \(n\) possible options to chose from for each of the \(r\) characters, resulting in \(n\cdot n\cdot n\cdots = n^r\). Likewise, the number of messages of size \(t\) encoded with alphabet \(\mathcal{Y}\) is \(m^t\). We then demand from our coding scheme that the number of messages we can encode is at least the number of messages we could potentially send. In other words, for our coding scheme to be able to take any message of size \(r\), it must be true that the number of possible encoded messages is at least as large as the number of possible messages to encode. This demand is expressed as \[ n^r \leq m^t. \] If our encoding did not satisfy this, we would have to increase \(t\), i.e., the number of characters we use to encode our message. For example, if codons were made out of only two basepairs, the genetic code would not be able to code for all 20 amino acids plus the stop codons. On the other extreme, we could develop a ridiculously long encoding scheme (imagine a version of the genetic code where 1000 basepair represented a single amino acid). To avoid this absurd scheme, we bound the encoded message’s size to be as long as necessary to encode all potential messages, but not any longer. This bound is expressed as \[ m^{t-1} < n^r \leq m^t. \label{eq:ineq_messages} \] Let us now take the logarithm on our previous inequality—this preserves the inequalities since \(\log\) is a monotonically increasing function—finding \[ (t - 1) \log(m) < r \log(n) \leq t \log(m). \] We are free to choose the logarithm base we find convenient; therefore, let us use base \(m\) for this, obtaining \[ t - 1 < r \log_m(n) \leq t. \label{eq:ineq_logm} \] Dividing Eq. \(\ref{eq:ineq_logm}\) by \(r\) gives \[ \frac{t-1}{r} < \log_m(n) \leq \frac{t}{r}. \label{eq:t_over_r} \] Let us stare at Eq. \(\ref{eq:t_over_r}\). In Eq. \(\ref{eq:ineq_messages}\), we established \(t\) as the minimum number of characters from alphabet \(\mathcal{Y}\) needed to encode a message of length \(r\) written with alphabet \(\mathcal{X}\) characters (such as MGF turned into AUGGGCUUC as in Fig. 9(B)). This means that, for the case where all symbols use the same number of characters when encoded, \(t/r\) is the number of characters from alphabet \(\mathcal{Y}\) per character from alphabet \(\mathcal{X}\), i.e., the proportionality constant \(k\) from Eq. \(\ref{eq:length_fn}\). This means that Eq. \(\ref{eq:t_over_r}\) implies \[ \log_m(n) \leq k. \] In other words, a lower bound for the number of characters from alphabet \(\mathcal{Y}\) needed to encode a character from alphabet \(\mathcal{X}\) is given by \(\log_m(n)\). For the amino acid to codon case, the minimum number of letters in a codon would be \(\log_4(20) \approx 2.16 > 2\). This shows why we could not encode all 20 amino acids with two basepair long codons. Furthermore, Eq. \(\ref{eq:t_over_r}\) implies that \[ \frac{t}{r} - \log_m(n) < \frac{t}{r} - \frac{(t-1)}{r}, \label{eq:t_over_r_diff} \] given that \((t-1)/r < \log_m(n)\). Simplifying Eq. \(\ref{eq:t_over_r_diff}\) results in \[ \frac{t}{r} - \log_m(n) < \frac{1}{r} \Rightarrow k - \log_m(n) < \frac{1}{r}. \label{eq:logn_1_over_r} \] Therefore, we can make \(k\), the number of encoding characters, as arbitrarily close to \(\log_m(n)\) as we want by increasing the length of the message being encoded, i.e., making \(r \rightarrow \infty\). This would imply a genetic code, not for individual amino acids but entire polypeptides. This scheme would not work biologically; nevertheless, this mathematical limit will help us find the functional form of our desired function \(H(\mathbf{p})\).

Coming back to the function \(H\), let us define \[ A(n) \equiv H \left(\frac{1}{n}, \frac{1}{n}, \frac{1}{n}, \ldots\right), \] as the maximum possible value of \(H\) when all outcomes are equally likely. Property 2 tells us that \(A(n)\) increases monotonically with the length of the message. This means that if we apply the function \(A(\cdot)\) to the terms in Eq. \(\ref{eq:ineq_messages}\), we conserve the inequality, i.e., \[ A(m^{t-1}) < A(n^r) \leq A(m^t). \label{eq:A_ineq} \] Using Property 3, we can divide the \(n^r\) possible choices into \(r\) independent decisions, each with \(n\) options to chose from. This property is depicted in Fig. 9(C). On the left, it shows we can choose from eight different codons that code for \(2^3 = 8\) different amino acids. On the right, we can choose base by base, building up the codon in three consecutive decisions, each with two equally likely choices, for a total of \(2\cdot 2 \cdot 2 = 8\) possible outcomes. This division of choices allows us to rewrite Eq. \(\ref{eq:A_ineq}\) as \[ (t - 1) A(m) < r A(n) \leq t A(m), \label{eq:A_div} \] because of our requirement of the uncertainty \(H\) being an additive property. For the example in Fig. 9(C), at each of the three decision steps, the uncertainty is given by \(A(2)\). Given that the uncertainty is additive, for each of the routes, our total uncertainty is given by \[ A(2) + A(2) + A(2) = 3 A(2), \] therefore \(A(2^3) = 3 A(2)\). Dividing Eq. \(\ref{eq:A_div}\) by \(r\) results in \[ \frac{(t - 1)}{r} A(m) < A(n) \leq \frac{t}{r} A(m). \] Since \(\frac{(t - 1)}{r} A(m) < A(n)\), it is also true that \[ \frac{t}{r} A(m) - A(n) < A(m) \left(\frac{t}{r} - \frac{(t-1)}{r} \right). \] Simplifying terms, we are left with \[ \frac{t}{r} A(m) - A(n) < \frac{1}{r} A(m). \] Dividing both sides by \(A(m)\), we find \[ k - \frac{A(n)}{A(m)} < \frac{1}{r}. \label{eq:A_1_over_r} \] We can make the ratio \(A(n)/A(m)\) as close to \(k\) as we want by making \(r\) larger. This equation looks shockingly similar to Eq. \(\ref{eq:logn_1_over_r}\), but what is the connection? On the one, hand Eq. \(\ref{eq:logn_1_over_r}\) is the result of imposing the condition that our coding scheme must be able to encode any possible message from one alphabet \(\mathcal{X}\) to another alphabet \(\mathcal{Y}\). This condition leads us to the conclusion that the number of characters from alphabet \(\mathcal{Y}\) needed to encode the characters from alphabet \(\mathcal{X}\) (the constant \(k\)) can be made as arbitrarily close to \(\log_m(n)\) as we want by writing a code, not for individual characters (individual amino acids), but for sequences of characters (polypeptides). On the other hand, Eq. \(\ref{eq:A_1_over_r}\) is a direct consequence of the three logical properties we imposed on our uncertainty metric \(H\). These properties led us to conclude that, whatever our uncertainty function for the equally likely choices \(A(\cdot)\) is, the ratio of the uncertainties for each of our two alphabets \(A(n)/A(m)\) approaches the same constant \(k\) as we make the encoded message longer. Since both \(\log_m(n)\) and \(A(n)/A(m)\) approach \(k\) as \(r\) grows, we can conclude that \[ \frac{A(n)}{A(m)} \rightarrow \frac{\log_m(n)}{\log_m(m)}\; \text{ as } r \rightarrow \infty. \] We wrote the ratio \(\log_m(n)/\log_m(m)\) because our choice of the logarithm base was arbitrary. Therefore, more generally, we have \[ \frac{A(n)}{A(m)} \rightarrow \frac{\log(n)}{\log(m)}\; \text{ as } r \rightarrow \infty, \] for any base. This convergence only takes place if and only if \[ A(n) = K \log(n), \] where \(K\) is some constant. This is quite beautiful. What we just demonstrated is that the functional form for the uncertainty metric we are after scales as the logarithm of the number of possible characters in our alphabet. We know that our uncertainty function \(H(1/n, 1/n, \ldots)\) is a function of \(1/n\) rather than of \(n\). This is easily fixed by using the properties of logarithms, writing \[ H\left(\frac{1}{n}, \frac{1}{n}, \ldots \right) = -K \log\left(\frac{1}{n} \right). \label{eq:entropy_equally} \] The general form of Shannon’s entropy is starting to show up. After all, for the case where all choices are equally likely, we have \(p_i = 1/n\). We can therefore write \[ H\left(\frac{1}{n}, \frac{1}{n}, \ldots \right) = -K \sum_{i=1}^n \frac{1}{n} \log\left(\frac{1}{n} \right). \]

Let us generalize the proof for cases where choices are not equally likely. To continue with the amino acid to codon encoding example, we now consider the genetic code’s redundancy. Given that there are \(4^3 = 64\) possible codons, multiple codons map to the same amino acid. An example of three amino acids that share the first letter is depicted on Fig. 9(D). The diagram on the left shows a total of nine different codons; two of such codons code for asparagine (N), three for isoleucine (I), and four for threonine (T). A way to express the asymmetry between the choices is to have each codon as an independent and equally likely choice, as depicted on the middle diagram of Fig. 9(D). Let us define the total number of codons \[ N = \sum_{i=1}^n n_i, \label{eq:sum_amino} \] where \(n_i\) counts the number of codons for amino acid \(i\), and \(n\) is the total number of amino acid choices. Let us call \(H_1\) the uncertainty of this set of equal choices. From Eq. \(\ref{eq:entropy_equally}\), we know that the resulting uncertainty function \(H_1\) is of the form \[ H_1 = K \log\left(\sum_{i=1}^n n_i \right) = K \log (N), \] since all codons are equally likely.

Although each codon is equally likely, the resulting amino acid is not. The probability of amino acid I in this case is the number of codons encoding it (two) divided by the total number of codons in the example (nine). In general, we assume that each of the \(n\) choices has a probability \[ p_i = \frac{\text{\# codons for amino acid }i}{\text{total \# of codons}} = \frac{n_i}{N}. \label{eq:p_i_amino} \] By Property 3 of our function \(H\), we can partition the codon’s choice into two consecutive decisions (not three since the first codon is the same for all amino acids in this example). This partitioning is shown on the right diagram of Fig. 9(D). The uncertainty \(H_2\) for this case has two contributions, one for each of the decisions \[ H_2 = \overbrace{H(p_1, p_2, \ldots, p_n)}^{\text{first choice}} + \overbrace{K \sum_{i=1}^n p_i \log n_i}^{\text{second choice}}. \] The first decision has an unknown functional form we are trying to figure out. The second choice consists of choosing between \(n_i\) equally likely bases for the codon’s last position, each weighted by the probability of going to this particular branch (the one that defines the amino acid) as demanded by Property 3. But whether or not we choose each codon on a single decision or in two steps, the uncertainty of this event is the same. This means that \(H_1 = H_2\) as Property 3 requires. This equality results in \[ K \log (N) = H(p_1, p_2, \ldots, p_n) + K \sum_{i=1}^n p_i \log(n_i). \] Solving for \(H(p_1, p_2, \ldots, p_n)\) results in \[ H(p_1, p_2, \ldots, p_n) = K \left[ \log N - \sum_{i=1}^n p_i \log(n_i) \right]. \] Using Eq. \(\ref{eq:sum_amino}\) results in \[ H(p_1, p_2, \ldots, p_n) = - K \left[ \sum_{i=1}^n p_i \log(n_i) - \log\left( \sum_{i=1}^n n_i \right) \right]. \] Since probabilities must be normalized, i.e., \(\sum_{i=1}^n p_i = 1\), we can write \[ H(p_1, p_2, \ldots, p_n) = - K \left[ \sum_{i=1}^n p_i \log(n_i) - \sum_{i=1}^n p_i \log\left( \sum_{i=1}^n n_i \right) \right]. \] Using the property of logarithms, we can rewrite this as \[ H(p_1, p_2, \ldots, p_n) = - K \left[ \sum_{i=1}^n p_i \log\left( \frac{n_i}{\sum_{i=1}^n n_i} \right) \right]. \] Using Eq. \(\ref{eq:p_i_amino}\), we find the expected result \[ H(p_1, p_2, \ldots, p_n) = - K \sum_{i=1}^n p_i \log p_i. \label{eq:shannon_result} \]

Let us dissect this result. We began this derivation by stating three logical properties that a metric for uncertainty should have. The properties could be summarized simply as 1) the function exists for all possible \(p_i\)s, 2) the uncertainty grows as the number of possible outcomes grows, and 3) the uncertainty must be additive. We thought about a coding scheme to encode a message written in an alphabet into a different one. We demanded that our coding scheme should be able to encode any message we want, and this led us to conclude that the average number of characters needed to encode each character on the original message can approach \(\log_m(n)\), where \(n\) is the number of characters in the original alphabet and \(m\) is the number of characters in the encoding alphabet. We then used the properties of our desired uncertainty function and found a non-obvious connection between the number of characters needed to pass from one alphabet to another and the uncertainty on the message. When we generalized this analysis to cases where not all outcomes are equally likely, we arrived at Eq. \(\ref{eq:shannon_result}\), the so-called Shannon entropy. This is Shannon’s theorem, and what it shows is that Eq. \(\ref{eq:shannon_result}\) is the only function that satisfies the three very reasonable conditions we established for an uncertainty measurement.

To gain intuition on what this equation is telling us, let us look at two examples. In our first example, we will think about the simplest random process: a coin toss. To compute how unpredictable the outcome of our simple coin toss is, we can use Eq. \(\ref{eq:shannon_result}\). For this particular case, we only have two possible outcomes—heads with probability \(p\) or tails with probability \(1 - p\). The resulting entropy is of the form \[ H = - p \log(p) - (1 - p) \log(1 - p). \label{eq:entropy_coin} \] Fig. 10(A) plots Eq. \(\ref{eq:entropy_coin}\) as a function of the probability of heads \(p\). Notice that the curve is concave with a minimum at \(p=0\) and \(p=1\) and a maximum at \(p=1/2\). This shape should make intuitive sense given that Eq. \(\ref{eq:entropy_coin}\) quantifies how unpredictable the outcome of tossing the coin is. If the coin toss’s outcome is always heads (p=1) or always tails (p=0), there is no uncertainty about the resulting face. The more both outcomes become (the closer \(p\) gets to \(1/2\)), the more unpredictable the random even is. One mathematical subtly here is that for \(p=1\) or \(p=0\), we have to compute \(0 \times \log(0)\), which is undefined. For these values of \(p\) we take \(0 \times \log(0) = 0\) since the limit where \(x \rightarrow 0^+\) converges to zero. Notice that the units on the \(y\)-axis are given in bits. These units mean that we used base two for our logarithms. An easy way to think about what a bit means is the number of yes/no questions one would need to ask on average to infer the random event’s outcome. For a coin, all we need is a single question (therefore one bit) to know what the outcome was.

For our second example, we go back to the mRNA steady-state distribution we derived in Eq. \(\ref{eq:mRNA_steady}\). We found that for our simple one-state DNA promoter, the steady-state distribution resulted in Poisson with a mean \(\langle m \rangle = r_m / \gamma_m\). Fig. 10(B) shows the entropy of this Poisson distribution as a function of the mean mRNA. We see a quick initial increase in this entropy up to \(\langle m \rangle \approx 20\), after which there is a much less steep increment. Imagine we sample a random cell from one of these Poisson distributions. Using the interpretation of bits again as the number of yes/no questions, what Fig. 10(B) tells us is that if the promoter produces \(\approx 10\) mRNA on average, it will take on average 3.5 of these questions to infer the number of mRNA for random cell. For an average of \(\approx 20\) mRNA, it would take four questions, and for an average of \(\approx 60\) mRNA, five questions. These questions would be of the form “is it greater than the average?” or “is it less than or equal to 1/3 of the average?,” and so on.

Figure 10: Shannon entropy in action. (A) The entropy of a coin as a function of the probability of heads p. The entropy is maximum when the coin is fair, i.e., p=0.5, meaning that this is the most unpredictable coin one could have. (B) The entropy of the steady-state mRNA distribution as derived in Eq. \ref{eq:mRNA_steady} as a function of the mean mRNA copy number. The point shows the entropy of the distribution shown in the inset. Bot figures use base 2 for the logarithm, resulting in units of bits for the entropy. The Python code (ch1_fig10.py) used to generate this figure can be found on the thesis GitHub repository.

Information Theory and Statistical Mechanics

Our result in Eq. \(\ref{eq:shannon_result}\) is of the same functional form as the thermodynamic entropy. The story goes that Shannon was discussing this concept with his friend John von Neumann. It was von Neumann who allegedly convinced Shannon of calling his metric of randomness entropy under the argument that nobody understands the concept. But the fact that the functional forms are the same is too suggestive to dismiss a potential connection between these concepts immediately. It was until much later that E. T. Jaynes formalized ways to link both ideas  [28]. Nevertheless, Jaynes himself strongly discourages people from trying to map one concept to the other explicitly. In his book “Probability Theory: The Logic of Science,” Jaynes warns the reader about failing to distinguish information entropy, which is a property of the mathematical object we call a probability distribution, and the experimental entropy of thermodynamics, which is instead a property of the state of the system as defined by experimentally measurable quantities such as volume, temperature, pressure, magnetization, etc. Jaynes goes on to say: “they should never have been called by the same name; the experimental entropy makes no reference to any probability distribution, and the information entropy makes no reference to thermodynamics [29].

When Jaynes makes such strong remarks about the disconnection between both entropy concepts, he strictly refers to the classical thermodynamic definition. This classical definition of entropy, due to Clausius, refers to the inability of any thermal engine to convert all of the input energy into useful work. Clausius defined a new quantity \(S\) as the amount of energy per unit temperature unavailable to do work. To understand this idea is to realize that from the energy liberated in gasoline combustion on a car engine, we only end up extracting \(\approx 20\%\) of the energy to move the car. The other \(80\%\) is lost into heating the engine and the environment. But this is not because the engineers are using poor designs. The second law of thermodynamics on its classical definition states that nothing in the universe can convert \(100\%\) of the energy into useful work; there will always be residual energy that gets turned into heat.

At the time, the existence of atoms was not widely accepted by the scientific community. But then came Boltzmann and the statistical mechanics’ conceptual revolution. The giant leap in our understanding of why the second law of thermodynamics does not allow the total conversion of energy into useful work came with Boltzmann’s revolutionary entropy idea. Boltzmann hypothesized that matter was made out of atoms. Therefore, everything we can observe and measure macroscopically about any system results from the microscopic configuration of all the atoms that make up the system. Furthermore, many microscopic arrangements are indistinguishable at our macroscopic scale (recall the microstate and macrostate concept in Fig. 2). This line of reasoning led Boltzmann to the law we stated in Eq. \(\ref{eq:boltzmann_law}\). This law and all of the classic results from statistical mechanics are founded on several assumptions about the microscopic scale processes’ reversibility. In other words, for Boltzmann’s law to be “a legit law of nature,” it must be the case that if we play a movie featuring a single atom moving around the system, the same movie played in reverse should be as equally likely to happen.

But it might be the case that the assumptions underlying statistical mechanics laws are not the most fundamental constructs of reality. As we will show next, we can derive a classic result of statistical mechanics from a completely different premise having to do more with statistical inference rather than physical laws of motion governing atoms. This becomes a circular argument where some physicists have the laws of motion as the defining foundation on which to base statistical mechanics laws is better. For others, having an information-theoretic justification for statistical mechanics independent of the underlying physical laws is more appealing. At the end of the day is a matter of taste. Having said all of this, let us delve into the connection between information-theoretic entropy and the Boltzmann distribution.

We already used the Boltzmann distribution when we computed the probability of an RNAP molecule being bound to the promoter \(p_{\text{bound}}\). The Boltzmann distribution applies to systems in thermodynamic equilibrium in contact with a heat bath at a constant temperature. Think of a small Eppendorf tube (\(\approx 2\) mL) that we perfectly seal before submerging it into the ocean. The tube’s temperature will equilibrate with that of the ocean, but the ocean’s temperature will not be affected by the tube’s presence. Submerging the tube into the reservoir allows the total energy of the tube not to be fixed. Sometimes the tube can borrow energy from the ocean; sometimes, it can give energy to it. The Boltzmann distribution precisely dictates the likelihood of such energy states. The probability of a state with energy \(E_i\) is given by \[ P(E_i) = \frac{e^{-\beta E_i}}{\mathcal{Z}}, \] where, as before, \(\beta \equiv (k_BT)^{-1}\). \(\mathcal{Z}\) is the partition function defined by the sum of the Boltzmann weight for all possible microstates, i.e., \[ \mathcal{Z} \equiv \sum_{\text{states}} e^{-\beta E_i}, \] where the sum is taken over all microstates available to the system. This equation is equivalent to Eq. \(\ref{eq:pbound_unreg}\) and Eq. \(\ref{eq:pbound_reg}\). We can derive this functional form from the so-called maximum entropy principle. This framework is expanded more in Chapter 5 of this thesis. But for our purposes here, the idea is that we are trying to make a “best guess” of what a distribution looks like, given limited information. For our Eppendorf tube inside the ocean, we are thinking about the distribution of all of the molecules’ microstates inside the tube. Experimentally, we never get to observe any of the microstates of the system. But we know that the probability of each microstate depends on its energy, as Boltzmann told us. Let us say we can measure the average energy \(\langle E \rangle\) of our little Eppendorf tube. What is then the optimal guess of the functional form of the distribution that does not use any information we do not have at hand? For example, we cannot say that there is only one microstate available to the system with energy \(\langle E \rangle\), because that constrains the possibilities of the system, and measuring the average energy does not lead to such a conclusion. The next best case we can do is to maximize the Shannon entropy, subject to this constraint on the average energy. This makes sense because, as we derived in the previous section, the Shannon entropy is the only functional form that satisfies our properties for a metric of uncertainty. Maximizing the Shannon entropy leads then to a maximally uninformative distribution. Including the constraints when implementing this maximization guarantees that we use all that we know about the distribution and nothing else.

Given Property 1 of our function \(H\), the Shannon entropy is continuous on the individual probabilities’ values \(p_i\). This means that we can maximize the Shannon entropy by taking its derivative with respect to \(p_i\) and equating it to zero. This operation does not include the constraints we have on the values of the probabilities of each microstate. Let us say that each microstate available to the system with energy \(E_i\) has a probability \(p_i\) of happening. The constraint on the average energy is given by \[ \langle E \rangle = \sum_{\text{states}} E_i p_i, \] where again, the sum is taken over all possible microstates. Furthermore, we know that the probability distribution must be normalized. This means that \[ \sum_{\text{states}} p_i = 1. \] To include these constraints in our optimization, we can use the Lagrange multipliers technique. We refer the reader to any introductory text on multivariate calculus for a quick refresher of this technique. We proceed by defining a Lagrangian \(\mathcal{L}\) of the form \[ \mathcal{L}(p_1, p_2,\ldots, p_N, \beta, \mu) = \overbrace{- \sum_{i=1}^N p_i \log (p_i)}^{\text{Shannon entropy}} - \overbrace{\beta \left(\sum_{i=1}^N E_i p_i - \langle E \rangle \right)}^ {\text{average energy constraint}} - \overbrace{\mu \left(\sum_{i=1}^N p_i - 1 \right)}^ {\text{normalization constraint}}, \] where \(N\) is the total number of microstates available to the system, and \(\beta\), and \(\mu\) are the Lagrange multipliers associated with each of the constraints. The next step consists on computing the gradient of this Lagrangian which returns a vector of size \(N\) where the \(k^{\text{th}}\) entry is the derivative of the Lagrangian with respect to \(p_k\). But notice that all of these derivatives will look the same. So taking one of these derivatives is enough. We then take the derivative with respect to a particular \(p_k\) and equate it to zero, obtaining \[ \frac{d\mathcal{L}}{d p_k} = -\log(p_k) - 1 - \lambda - \beta E_k = 0. \] Notice that all of the terms with \(i\neq k\) disappear, leaving a simple expression. Solving for \(p_k\) gives \[ p_k = \exp\left[1 - \lambda - E_k \right] = e^{1 - \lambda} e^{-\beta E_k}. \] Every single probability \(p_k\) takes the same form. We substitute this probability \(p_k\) on our normalization constraint, obtaining \[ \sum_{i=1}^N p_i = e^{1 - \lambda} \sum_{i=1}^N e^{-\beta E_i}=1. \] This tells us that the term \(e^{1 - \lambda}\) is given by \[ e^{1 - \lambda} = \frac{1}{\sum_{i=1}^Ne^{-\beta E_i}}. \] Therefore, the probability of microstate \(i\) is given by \[ P(E_i) = p_i = \frac{e^{-\beta E_i}}{\sum_{i=1}^N e^{-\beta E_i}}, \] exactly the Boltzmann distribution. One can show why it is the case that our Lagrange multiplier \(\beta\) is exactly \(1/k_BT\) as demanded by the thermodynamic version of this distribution, but that is out of the scope for our purposes. This section aims only to show the subtle and deep connection between statistical mechanics and information theory. This connection suggests that part of the unreasonable effectiveness of statistical mechanics might not come from the physical basis of its core theory; but instead from the statistical inference problem on which, given the limited information we have of any thermodynamic system’s microstate, entropy maximization gives us a recipe on what the best guess for the probability distribution over the microstates is.

Joint Uncertainty in an Uncertain World

Part of the complexity in understanding biological systems is that their components form a network of interactions. This connectivity means that one part of the organism’s state depends on many other parts’ states. For example, the wild-type lac operon’s expression depends on the conformation state of two transcription factors: CRP and LacI. The state of these transcription factors depends on the concentration of cyclic-AMP and allolactose, respectively. These concentrations rely on the state of the environment and transporters’ availability to bring them into the cell. This chain of connections continues indefinitely.

The mathematical language to express the dependence between two variables is that of joint and conditional probability. Shannon’s entropy (Eq. \(\ref{eq:shannon_result}\)) can also be extended to account for dependence between variables. To make the notation for this extension easier to follow, let us use a different notation from now on. Let us express Shannon’s entropy as \[ H(m) = -\sum_m P(m) \log P(m), \label{eq:shannon_x} \] where instead of giving a vector of probabilities \(\mathbf{p}\) to the function \(H\), we now give it a random variable \(m\). This notation is understood as: the entropy is calculated over the distribution of possible values that \(m\) can take. If \(m\) can take values \(\{m_1, m_2, \ldots, m_n\}\), the probability of obtaining \(m = m_k\) is given by the function \(P(m=m_k)\), which for brevity we can write simply as \(P(m_k)\). What Eq. \(\ref{eq:shannon_x}\) is saying is: take the random variable \(m\) and all the possible values it can have; compute the Shannon entropy by summing over the probability of all those values. In this way, \(H(m)\) is a shorthand for writing \(H[P(m)]\).

With this notation in hand, let us think about two correlated random variables \(m\) and \(p\). These could be the number of mRNAs and proteins in the cells, as depicted in Fig. 11(A). The joint entropy \(H(m, p)\) measures the uncertainty we have about the outcome of a pair of variables rather than a single. All it takes is to sum over both variables on Eq. \(\ref{eq:shannon_x}\) as \[ H(m, p) = -\sum_m \sum_p P(m, p) \log P(m, p). \label{eq:joint_entropy} \] Eq. \(\ref{eq:joint_entropy}\) then does the same computation as Eq. \(\ref{eq:shannon_x}\), except that the sum is taken over all possible pairs of random variables \(m\) and \(p\). But what if we get to observe the outcome of one of the two variables (observing mRNA via RNA-seq, for example), can that tell us something about the outcome of the other one? For this, we need to understand the concept of conditional entropy.

Figure 11: Shannon’s entropy for more than one random variable. (A) Toy model of a random process where mRNA (random variable m) is stochastically produced as a Poisson process with a fixed mean. Proteins (random variable p) are also stochastically produced as a Poisson process, but the mean depends on the number of mRNAs. (B) Samples from the model presented in (A). The center plot shows the joint distribution P(m, p), while the edge histograms show the marginal distributions P(m) and P(p). (C) Venn diagram of the relationship of different information metrics. The Python code (ch1_fig11.py) used to generate this figure can be found on the thesis GitHub repository.

Thinking Conditionally, a Condition for Thinking

In Joe Blitztein’s excellent Introduction to probability  [30], he clarifies how conditional probability is one of the most powerful concepts in probability theory. Through the concept of conditional probability, we can learn whether or not two things are somehow correlated, allowing us from there to dissect the nature of such correlation. Given the probabilistic nature of Shannon’s entropy, the power of conditional entropy is extended to the so-called conditional entropy \(H(p \mid m)\). Let us think of our two random variables \(m\) and \(p\) with a joint probability distribution \(P(m, p)\). We can assume that the outcome of both random variables is correlated for our mRNA-protein pair, meaning that specific pairs of values are more likely to appear. If we observed the outcome of one of the random variables and knew the correlation function between random variables, our guess for the variable’s value that we did not observe would improve over a completely random choice. In our example, if we get to observe that \(m\) is a small (or large) number, we would suspect that \(p\) is also a small (or large) number, as shown in Fig. 11(B). This means that our uncertainty on the value of \(p\) changed—it was reduced—upon observing the value of \(m\). The new uncertainty, i.e., the entropy of \(p\) having learned the value of \(m\), averaged over all possible values of \(m\), is computed as \[ H(p \mid m) = - \sum_m \sum_p P(m) P(p \mid m) \log P(p \mid m), \label{eq:conditional_entropy} \] where \(P(p \mid m)\) is read as “probability of \(p\) given that we observe \(m\).” Finally, with all these concepts in hand, we can discuss the idea of information in the Shannon sense.

One Person’s Entropy is Another Person’s Information

So far, our discussion has focused on the concept of entropy. We first derived the Shannon entropy from three basic principles that a metric of uncertainty should satisfy. Then, we showed that one of the main statistical mechanics results, i.e., the Boltzmann distribution, could be derived from maximizing this entropy subject to certain constraints, suggesting that statistical mechanics could be nothing more than an optimal statistical inference protocol, given limited information. But no mention of information up to now. This intentional omission is because we first needed to master the idea of entropy to understand the mathematical definition of information.

Recall that \(H(p)\) quantifies the uncertainty about the outcome of the random process that generates the value of the variable \(p\). Furthermore, \(H(p \mid m)\) quantifies the uncertainty about the outcome of the same variable, but this time observing the outcome of the random variable \(m\). In the worst-case scenario, \(m\) and \(p\) are uncorrelated, and learning the value of \(m\) does not tell us anything about \(p\). In that case, we then have that \[ H(p \mid m) = H(p)\;\; \text{ for $m$ and $p$ uncorrelated}. \] If \(m\) and \(p\) are correlated, as depicted in Fig. 11(B), then the uncertainty about \(p\) is reduced upon learning the value of \(m\), giving us a general relationship between marginal and conditional entropy of the form \[ H(p) \geq H(p \mid m). \] In this latter scenario, learning the value of \(m\) reduced our uncertainty in the possible value of \(p\). This reduction in uncertainty agrees with an informal definition of what “obtaining information” means. We can then define the mutual information \(I(m;p)\) between random variable \(m\) and \(p\) as the reduction in uncertainty about the value of one of the random variables when we learn the value of the other random variable. For our example in which we get to observe the mRNA copy number, this would mean that the mutual information is computed as \[ I(m; p) \equiv H(p) - H(p \mid m). \label{eq:mutual_info_entropy} \] But the mutual information is symmetric, meaning that the information about the outcome of one of the variables by observing the other variables is the same when the roles of what we get to observe are inverted. This argument means that we can mathematically show that \[ I(m; p) = H(m) - H(m \mid p). \] This symmetry is why traditionally, the mutual information is written with a semi-colon rather than a regular comma, indicating that the order of the variables does not matter. To show the above symmetry, let us substitute the definitions of the marginal conditional entropy. This substitution for Eq. \(\ref{eq:mutual_info_entropy}\) results in \[ I(m; p) = - \sum_p P(p)\log P(p) - \left[ - \sum_m \sum_p P(m) P(p \mid m) \log P(p \mid m)\right]. \label{eq:mutual_info_probs} \] The trick is now to use the definition of conditional probability in the right way. We know that the conditional probability is defined as \[ P(p \mid m) \equiv \frac{P(m, p)}{P(m)}. \label{eq:cond_prob} \] Furthermore, we know that we can obtain the probability \(P(p)\) by marginalizing the joint distribution \(P(m, p)\) over all values of \(m\). Mathematically this is written as \[ P(p) = \sum_m P(m, p). \label{eq:marginalization} \] What Eq. \(\ref{eq:marginalization}\) is stating is that to compute the probability of observing value \(p\) of our random variable, we can add the probability of all pairs \(m, p\) with the desired that have the desired value of \(p\). For Eq. \(\ref{eq:mutual_info_probs}\), we substitute Eq. \(\ref{eq:marginalization}\) on the first term (outside of the \(\log\)) of the right-hand side and Eq.~\(\ref{eq:cond_prob}\) on the second term (in and outside of the \(\log\)), obtaining \[ I(m; p) = - \sum_p \left[ \sum_m P(m, p) \right]\log P(p) + \sum_m \sum_p P(m, p) \log \frac{P(m, p)}{P(m)}. \] Since the order of the sums do not matter, we can factorize the common terms on the left-hand side and use the properties of logarithms to write \[ I(m; p) = \sum_m \sum_p P(m, p) \log \frac{P(m, p)}{P(m)P(p)}. \] It is now easier to see that we would arrive at the same result if we started with the opposite conditional entropy \(P(m \mid p)\). These series of manipulations where we write either joint or conditional entropies will become handy in this thesis as we explore biophysical models of how to compute gene expression input-output functions (more on that in Chapter 3). Fig. 11(C) shows a schematic representation of the relationship of all the entropy-based quantities that we explored in this chapter. Although it is impossible to cover an entire field in a short introduction, I hope this intuitive explanation will suffice to understand the rest of the thesis.

References

[1]
E. Schrödinger, What is life?: With mind and matter and autobiographical sketches (Cambridge University Press, 1992).
[2]
R. Phillips, Schrödinger’ “What is Life?” at 75, ArXiv 1 (2021).
[3]
L. Cronin and S. I. Walker, Beyond prebiotic chemistry, Science 352, 1174 (2016).
[4]
P. Davies, The Demon in the machine: How hidden webs of information are solving the mystery of life (University of Chicago Press, 2019).
[5]
C. Adami, What is information?, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374, 20150230 (2016).
[6]
S. F. Taylor, N. Tishby, and W. Bialek, Information and fitness, ArXiv (2007).
[7]
W. Bialek, Biophysics: searching for principles (Princeton University Press, 2012).
[8]
D. F. Browning and S. J. W. Busby, The regulation of bacterial transcription initiation, Nature Reviews Microbiology 2, 57 (2004).
[9]
W. T. Ireland, S. M. Beeler, E. Flores-Bautista, N. S. McCarty, T. Röschinger, N. M. Belliveau, M. J. Sweredoski, A. Moradian, J. B. Kinney, and R. Phillips, Deciphering the regulatory genome of Escherichia coli, one hundred promoters at a time, Elife 9, e55308 (2020).
[10]
G. K. Ackers, A. D. Johnson, and M. A. Shea, Quantitative model for gene regulation by lambda phage repressor, Proceedings of the National Academy of Sciences 79, 1129 (1982).
[11]
L. Bintu, N. E. Buchler, H. G. Garcia, U. Gerland, T. Hwa, J. Kondev, T. Kuhlman, and R. Phillips, Transcriptional regulation by the numbers: Applications, Current Opinion in Genetics & Development 15, 125 (2005).
[12]
T. Kuhlman, Z. Zhang, M. H. Saier, and T. Hwa, Combinatorial transcriptional control of the lactose operon of Escherichia coli, Proceedings of the National Academy of Sciences 104, 6043 (2007).
[13]
S. H. Strogatz, Nonlinear dynamics and chaos with student solutions manual: With applications to physics, biology, chemistry, and engineering (CRC Press, 2018).
[14]
K. Dill and S. Bromberg, Molecular driving forces: statistical thermodynamics in biology, chemistry, physics, and nanoscience (Garland Science, 2010).
[15]
R. P. Feynman, R. B. Leighton, and M. Sands, The Feynman lectures on physics; vol. i, American Journal of Physics 33, 750 (1965).
[16]
R. Phillips, N. M. Belliveau, G. Chure, H. G. Garcia, M. Razo-Mejia, and C. Scholes, Figure 1 theory meets figure 2 experiments in the study of gene expression, Annual Review of Biophysics 48, 121 (2019).
[17]
H. Bremer and P. P. Dennis, Modulation of chemical composition and other parameters of the cell by growth rate, Escherichia Coli and Salmonella: Cellular and Molecular Biology (n.d.).
[18]
A. Schmidt, K. Kochanowski, S. Vedelaar, E. Ahrne, B. Volkmer, L. Callipo, K. Knoops, M. Bauer, R. Aebersold, and M. Heinemann, The quantitative and condition-dependent Escherichia coli proteome, Nature Biotechnology 34, 104 (2016).
[19]
I. L. Grigorova, N. J. Phleger, V. K. Mutalik, and C. A. Gross, Insights into transcriptional regulation and sigma competition from an equilibrium model of RNA polymerase binding to DNA, Proceedings of the National Academy of Sciences 103, 5332 (2006).
[20]
H. G. Garcia and R. Phillips, Quantitative dissection of the simple repression input-output function, Proceedings of the National Academy of Sciences 108, 12173 (2011).
[21]
Y.-J. Chen, P. Liu, A. A. K. Nielsen, J. A. N. Brophy, K. Clancy, T. Peterson, and C. A. Voigt, Characterization of 582 natural and synthetic terminators and quantification of their design constraints, Nature Methods 10, 659 (2013).
[22]
A. Eldar and M. B. Elowitz, Functional roles for noise in genetic circuits, Nature 467, 167 (2010).
[23]
M. Voliotis, R. M. Perrett, C. McWilliams, C. a. McArdle, and C. G. Bowsher, Information transfer by leaky, heterogeneous, protein kinase signaling systems, Proceedings of the National Academy of Sciences 111, E326 (2014).
[24]
N. Q. Balaban, Bacterial persistence as a phenotypic switch, Science 305, 1622 (2004).
[25]
U. Gerland and T. Hwa, On the selection and evolution of regulatory DNA motifs, Journal of Molecular Evolution 55, 386 (2002).
[26]
A. Sanchez, S. Choubey, and J. Kondev, Stochastic models of transcription: From single molecules to single cells, Methods 62, 13 (2013).
[27]
C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal 27, 379 (1948).
[28]
E. T. Jaynes, Information theory and statistical mechanics, Physical Review 106, 620 (1957).
[29]
E. T. Jaynes, Probability theory: The logic of science (Cambridge University Press, 2003).
[30]
J. K. Blitzstein and J. Hwang, Introduction to probability (Chapman; Hall/CRC, 2019).