The codon mutation rates \(Q_{xy}\) are the result of nucleotide mutations.
The nucleotide mutation rates can either be specified based on experimental measurements, or treated as free parameters that are optimized from the sequence data. The nucleotide mutation rates are assumed to be shared among all sites. Let \(R_{m \rightarrow n}\) denote the rate of mutation from nucleotide m to n (more specifically, it is the probability that a nucleotide mutates from m to n in a small unit of time given that the nucleotide is already m), and let \(m_c\) denote the complement of nucleotide m (so for example \(T_c = A\)).
We assume that codon mutations happen a single nucleotide at a time, so that the \(Q_{xy}\) in Equation (1) are given by
(2)\[\begin{split}Q_{xy} =
\begin{cases}
0 & \mbox{if $x$ and $y$ differ by more than on nucleotide} \\
R_{m \rightarrow n} & \mbox{if $x$ differs from $y$ by a single-nucleotide change of $m$ to $n$}.
\end{cases}\end{split}\]
Most standard approaches to modeling nucleotide substitution (for example, the GTR model) assume some set of “equilibrium” nucleotide frequencies. Here we avoid taking such an approach, because these “equilibrium” frequencies lack a clear biological meaning since nucleotide frequencies are ultimately the result of mutation and selection. We therefore define the mutation rates solely in terms of mutation parameters.
There are four nucleotides, so in principle there are 12 mutation rates.
However in practice we expect to have the constraint:
(3)\[R_{m \rightarrow n} = R_{m_c \rightarrow n_c}.\]
This constraint means, for example, that the rate from A to G is equal to the rate from T to C. The reason we expect this to be true is biological: a mutation from m to n on the sequenced strand causes a mutation of \(m_c\) to \(n_c\) on the complementary strand, so it is not possible to distinguish between these two types of mutations assuming that both strands are subject to the same mutational processes. This assumption reduces the number of independent mutation parameters to the following six parameters:
- \(R_{A \rightarrow C}\)
- \(R_{A \rightarrow G}\)
- \(R_{A \rightarrow T}\)
- \(R_{C \rightarrow A}\)
- \(R_{C \rightarrow G}\)
- \(R_{C \rightarrow T}\)
Given these mutation, rates, it is useful to determine the expected evolutionary equilibrium codon frequencies in the case where all amino acids and stop codons are selectively neutral, and so evolution is driven entirely by the mutation process defined by Equations (3) and (2) (in other words, the expected evolutionary equilibrium codon frequencies in the case where \(F_{r,xy} = 1\) in Equation (1)). In this case, the expected evolutionary equilibrium frequencies are determined entirely by the mutation rates as expected biologically. It is straightforward to show that for nonzero mutation rates that satisfy Equation (3), the expected nucleotide frequency of a A or T nucleotide is \(\frac{1}{2}\times \frac{R_{C \rightarrow A} + R_{C \rightarrow T}}{R_{A \rightarrow C} + R_{A \rightarrow G} + R_{C \rightarrow A} + R_{C\rightarrow T}}\) and the expected nucleotide frequency of a C or G nucleotide is \(\frac{1}{2} \times \frac{R_{A \rightarrow C} + R_{A \rightarrow G}}{R_{A \rightarrow C} + R_{A \rightarrow G} + R_{C \rightarrow A} + R_{C\rightarrow T}}\). The expected evolutionary equilibrium frequency \(q_x\) of codon x in the case where all amino acids and stop codons are selectively neutral are simply the product of these nucleotide frequencies, so
(4)\[q_x = \frac{1}{8} \times \frac{\left(R_{A \rightarrow C} + R_{A \rightarrow G} \right)^{\mathcal{N_{CG}}\left(x\right)}\times \left(R_{C \rightarrow A} + R_{C \rightarrow T} \right)^{\left(3 - \mathcal{N_{CG}}\left(x\right)\right)} }{\left(R_{A \rightarrow C} + R_{A \rightarrow G} + R_{C \rightarrow A} + R_{C\rightarrow T}\right)^3}\]
where \(\mathcal{N_{CG}}\left(x\right)\) is the number of C or G nucleotides in codon x.
It is easy to verify that \(q_x\) as defined by Equation (4) specifies the stationary state of a Markov process governed by the transition matrix defined by \(Q_{xy}\) given in Equation (2). For instance, if the diagonal elements of this matrix are defined as
(5)\[Q_{xx} = 1 - \sum\limits_{y \ne x} Q_{xy}\]
then the matrix \(\left[Q_{xy}\right]\) is a stochastic matrix, and so in general will have a unique stationary state with eigenvalue one. It is possible to verify by direct substitution that the \(q_x\) defined by Equation (4) is that eigenvector (normalized to sum to one).
Unfortunately, the process defined by \(Q_{xy}\) via Equations (3) and (2) is not guaranteed to be reversible. The reversibility condition is
(6)\[q_x Q_{xy} = q_y Q_{yx}.\]
To see that this condition is not satisfied, let \(x = AAT\) and \(y = CAT\) so that Equation (6) reduces
(7)\[\begin{split}\left(R_{C \rightarrow A} + R_{C \rightarrow T} \right)^3 \times R_{A \rightarrow C} &=& \left(R_{A \rightarrow C} + R_{A \rightarrow G} \right) \times \left(R_{C \rightarrow A} + R_{C \rightarrow T} \right)^2 \times R_{C \rightarrow A} \implies \\
\left(R_{C \rightarrow A} + R_{C \rightarrow T} \right) \times R_{A \rightarrow C} &=& \left(R_{A \rightarrow C} + R_{A \rightarrow G} \right) \times R_{C \rightarrow A} \implies \\
R_{C \rightarrow T} \times R_{A \rightarrow C} &=& R_{A \rightarrow G} \times R_{C \rightarrow A}.\end{split}\]
In general this last condition is not guaranteed to be true. Therefore, to make the model reversible, it is necessary to define
(8)\[R_{C \rightarrow T} = \frac{R_{A \rightarrow G} \times R_{C \rightarrow A}}{R_{A \rightarrow C}}.\]
The biological meaning of this constraint is not particularly clear or intuitive, but essentially it means that the path to getting from an A to a T via an intermediate mutation to C is equally likely to the path from getting from an A to a T via an intermediate mutation to G.
With the constraint in Equation (8) plus Equation (3), there are a total of five unique mutation rates:
- \(R_{A \rightarrow C}\)
- \(R_{A \rightarrow G}\)
- \(R_{A \rightarrow T}\)
- \(R_{C \rightarrow A}\)
- \(R_{C \rightarrow G}\)
The equilibrium codon frequencies with this constraint can be simplified to
(9)\[q_x = \frac{1}{8} \times \frac{\left(R_{A \rightarrow C} + R_{A \rightarrow G} \right)^{\mathcal{N_{CG}}\left(x\right)}\times \left(R_{C \rightarrow A} + \frac{R_{A \rightarrow G} \times R_{C \rightarrow A}}{R_{A \rightarrow C}} \right)^{\left(3 - \mathcal{N_{CG}}\left(x\right)\right)} }{\left(R_{A \rightarrow C} + R_{A \rightarrow G} + R_{C \rightarrow A} + \frac{R_{A \rightarrow G} \times R_{C \rightarrow A}}{R_{A \rightarrow C}}\right)^3}.\]
Therefore, a reversible codon mutation model can be established in terms of five mutation rate parameters. Now Equation (6) is satisfied.
In many cases, only four of the five mutation rates are really free parameters (this would be the case unless specific numerical values are attached to the mutation rates, or unless it is not the case that the branch lengths are free parameters, or unless the sequences are date-stamped). In the case where the mutation rates are not scaled by any other means, we will use the convention of setting \(R_{A \rightarrow C} = 1\) and then view the other four mutation rates as relative to this rate of one. In this case, it turns out (see An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs) that
(10)\[q_x \propto \left(R_{C\rightarrow A}\right)^{\mathcal{N}_{AT}\left(x\right)}.\]
Essentially, Equation (10) holds when the mutationrates parameter described in the Input file has the value freeparameters.
When the mutationrates parameter described in Input file does not have the value freeparameters, and we are instead specifying specific numerical rates, then it is no longer the case that \(R_{A \rightarrow C} = 1\). However, in this case we can still simplify the expression for \(q_x\) to
(11)\[\begin{split}q_x & \propto & \left(R_{A \rightarrow C} + R_{A \rightarrow G} \right)^{\mathcal{N}_{CG}\left(x\right)}\times \left(R_{C \rightarrow A} + \frac{R_{A \rightarrow G} \times R_{C \rightarrow A}}{R_{A \rightarrow C}} \right)^{\left(3 - \mathcal{N}_{CG}\right)} \\
& \propto & \left(R_{A \rightarrow C} + R_{A \rightarrow G} \right)^{\mathcal{N}_{CG}\left(x\right)} \times \left(\frac{R_{C \rightarrow A}}{R_{A \rightarrow C}}\right)^{3 - \mathcal{N}_{CG}\left(x\right)} \times \left(R_{A \rightarrow C} + R_{A \rightarrow G} \right)^{3 - \mathcal{N}_{CG}\left(x\right)} \\
& \propto & \left(\frac{R_{C \rightarrow A}}{R_{A \rightarrow C}}\right)^{\mathcal{N}_{AT}\left(x\right)}.\end{split}\]
We consider two different models for how the fixation probabilities \(F_{r,xy}\) depend on the amino-acid preferences \(\pi_{r,xy}\). We refer to the first model as the FracTolerated model since it conceptually considers the preferences to be related to the fraction of genetic backgrounds that tolerate a given mutation. In this model,
(12)\[\begin{split}F_{r,xy} =
\begin{cases}
1 & \mbox{if $\mathcal{A}\left(x\right) = \mathcal{A}\left(y\right)$} \\
\omega & \mbox{if $\mathcal{A}\left(x\right) \ne \mathcal{A}\left(y\right)$ and $\pi_{r,\mathcal{A}\left(y\right)} \ge \pi_{r,\mathcal{A}\left(x\right)}$} \\
\omega \times \left(\frac{\pi_{r,\mathcal{A}\left(y\right)}}{\pi_{\mathcal{A}\left(x\right)}}\right)^{\beta} & \mbox{otherwise.}
\end{cases}\end{split}\]
where \(\mathcal{A}\left(x\right)\) denotes the amino acid encoded by codon x and the meaning of \(\omega\) is discussed below.
We refer to the second model as the HalpernBruno model since it was originally introduced by Halpern and Bruno, MBE, 1998. This model interprets the amino-acid preferences \(\pi_{r,xy}\) as being related to selection coefficients, and specifies,
(13)\[\begin{split}F_{r,xy} =
\begin{cases}
1 & \mbox{if $\mathcal{A}\left(x\right) = \mathcal{A}\left(y\right)$} \\
\omega & \mbox{if $\mathcal{A}\left(x\right) \ne \mathcal{A}\left(y\right)$ and $\pi_{r,\mathcal{A}\left(x\right)} = \pi_{r,\mathcal{A}\left(y\right)}$} \\
\omega \times \frac{\beta \times \ln\left(\pi_{r,\mathcal{A}\left(y\right)} / \pi_{r,\mathcal{A}\left(x\right)}\right)}{1 - \left(\pi_{r,\mathcal{A}\left(x\right)} / \pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta}} & \mbox{otherwise.}
\end{cases}\end{split}\]
In both Equations (12) and (13), \(\omega\) is a parameter that scales the rate of nonsynonymous versus synonymous mutations. A large value of \(\omega\) means that nonsynonymous mutations are elevated relative to the expectation from the amino-acid preferences, and a small value means that they are depressed. Fixing \(\omega = 1\) means that the amino-acid preferences naturally describe the relative rates of nonsynonymous and synonymous mutations by capturing all of the selection. This script allows models either with \(\omega = 1\), or \(\omega\) as a free parameter.
In both Equations (12) and (13), \(\beta\) is a parameter that corresponds to the relative strength of selection the experiment used to determind \(\pi_{r,a}\) versus natural evolution. The actual preference in nature for an amino-acid is taken to be proportional to \(\pi_{r,a}^{\beta}\). Therefore, values of \(\beta > 1\) correspond to stronger selection in nature than in the experiments, and values of \(\beta < 1\) correspond to weaker selection in nature than in the experiments. This script allows models with either the fixed value of \(\beta = 1\) or with \(\beta\) as a free parameter.
Note that the definitions in both Equations (12) and (13) are reversible with respect to \(\left(\pi_{r,x}\right)^{\beta}\), such that
(14)\[\left(\pi_{r,x}\right)^{\beta} F_{r,xy} = \left(\pi_{r,y}\right)^{\beta} F_{r,yx}.\]
Evaluation of the likelihood of a phylogenetic tree requires specifying expected evolutionary equilibrium frequency for the codons at the root node as well as the transition probabilities \(P_{r,xy}\) given in Equation (1). It is also mathematically convenient if the overall substitution model is reversible.
Here we show that for the choices of \(Q_{xy}\) and \(F_{r,xy}\) defined above, the substitution model is reversible and the equilibrium frequencies can be calculated.
Specifically, we let \(p_{r,x}\) denote the evolutionary equilibrium frequency of codon x at site r. We assume stop codon variants are non-viable (they have \(\pi_{r,x} = 0\)) and so the evolutionary equilibrium frequency for all stop codons is zero. This reduces us to 61 possible values for x at each site, corresponding to the the 61 non-stop codons.
The evolutionary equilibrium codon frequencies are given by:
(15)\[\begin{split}p_{r,x} &=& \frac{\left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta} \times q_x }{\sum\limits_y \left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta} \times q_y} \\\end{split}\]
To see that these are the evolutionary equilibrium frequencies, define
(16)\[P_{r,xx} = -\sum\limits_{y \ne x} P_{r,xy}.\]
With the definition in Equation (16), the matrix \(\mathbf{I} + \mathbf{P_r}\) is a stochastic matrix, where \(\mathbf{I}\) is the identity matrix and \(\mathbf{P_r} = \left[P_{r,xy}\right]\). For plausible values for the mutation rates and amino-acid preferences, \(\mathbf{I} + \mathbf{P_r}\) will also be irreducible and acyclic. Therefore, it has a unique (within a scaling constant) principal eigenvector \(\mathbf{p_r} = \left[p_{r,x}\right]\) with an eigenvalue of one that represents the equilibrium frequencies. In other words, \(\mathbf{p_r} = \mathbf{p_r} \left(\mathbf{I} + \mathbf{P_r}\right)\). It can be verified that Equation (15) defines such an eigenvector by writing this eigenvector equation in element-wise form. Immediately below, I verify this for the case where \(F_{r,xy}\) is defined by Equation (12), and \(\pi_{r,\mathcal{A}\left(y\right)} > \pi_{r,\mathcal{A}\left(x\right)}\). Other cases can be verified similarly.
(17)\[\begin{split}p_{r,x} &=& p_{r,x} + \sum\limits_y p_{r,y} P_{r,yx} \\
&=& p_{r,x} + (\sum\limits_{y\ne x} p_{r,y} P_{r,yx}) + p_{r,x} P_{r,xx} \\
&=& p_{r,x} + (\sum\limits_{y\ne x} p_{r,y} P_{r,yx}) - p_{r,x} \sum\limits_{y\ne x} P_{r,xy} \\
&=& p_{r,x} + \sum\limits_{y \ne x} \left(p_{r,y} P_{r,yx} - p_{r,x} P_{r,xy}\right) \\
&=& p_{r,x} + \sum\limits_{y \ne x} \left(p_{r,y} Q_{yx} F_{r,yx} - p_{r,x} Q_{xy} F_{r,xy}\right) \\
&=& p_{r,x} + \sum\limits_{y \ne x} \left(\left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta} q_y Q_{yx} F_{r,yx} - \left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta} q_x Q_{xy} F_{r,xy}\right) \\
&=& p_{r,x} + \sum\limits_{y \ne x} \left(\left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta} q_y Q_{yx} \left(\frac{\pi_{r,\mathcal{A}\left(x\right)}}{\pi_{r,\mathcal{A}\left(y\right)}}\right)^{\beta} - \left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta} q_x Q_{xy} \right) \\
&=& p_{r,x} + \left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta} \sum\limits_{y \ne x} \left( q_y Q_{yx} - q_x Q_{xy} \right) \\
&=& p_{r,x}\end{split}\]
where the last line follows from Equation (6). This verifies that \(p_{r,x}\) is the stationary state of the Markov process defined by \(P_{r,xy}\).
To verify that the substitution model is reversible, it is necessary to show that \(0 = p_{r,x} P_{r,xy} - p_{r,y} P_{r,yx}\). This is shown below for the case where \(F_{r,xy}\) is defined by Equation (12), and \(\pi_{r,\mathcal{A}\left(y\right)} > \pi_{r,\mathcal{A}\left(x\right)}\). Other cases can be verified similarly.
(18)\[\begin{split}0 &=& p_{r,x} P_{r,xy} - p_{r,y} P_{r,yx} \\
&=& \left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta} \times q_x \times Q_{xy} \times F_{r,xy} - \left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta} \times q_y \times Q_{yx} \times F_{r,yx} \\
&=& \left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta} \times q_x \times Q_{xy} - \left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta} \times q_y \times Q_{yx} \times \left(\frac{\pi_{r, \mathcal{A}\left(x\right)}}{\pi_{r, \mathcal{A}\left(y\right)}}\right)^{\beta} \\
&=& q_x \times Q_{xy} - q_y \times Q_{yx} \\
&=& 0\end{split}\]
where the last line follows from Equation (6).
The fact that \(P_{r,xy}\) defines a reversible Markov process with stationary state \(p_{r,x}\) means that it is possible to define a symmetric matrix \(\mathbf{S_r}\) such that
(19)\[\mathbf{S_r} \rm{diag}\left(\ldots, p_{r,x}, \ldots\right) = \mathbf{P_r}\]
where \(\rm{diag}\left(\ldots, p_{r,x}, \ldots\right)\) is the diagonal matrix with \(p_{r,x}\) along its diagonal. Noting \(\mathbf{S_r} = \mathbf{P_r} \rm{diag}\left(\ldots, \frac{1}{p_{r,x}}, \ldots\right)\), we have
(20)\[\begin{split}S_{r,xy} =
\begin{cases}
\frac{P_{r,xy}}{p_{r,y}} = 0 & \mbox{if $x$ and $y$ differ by more than one nucleotide mutation,} \\
\frac{P_{r,xy}}{p_{r,y}} = \left( \sum\limits_z \left(\pi_{r,\mathcal{A}\left(z\right)}\right)^{\beta} \times q_z \right) \frac{Q_{xy}}{q_y} \frac{F_{r,xy}}{\left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta}} & \mbox{if $x$ and $y$ differ by one nucleotide,} \\
\frac{P_{r,xx}}{p_{r,x}} & \mbox{otherwise.} \\
\end{cases}\end{split}\]
This matrix is symmetric since \(S_{r,xy} = S_{r,yx}\) as can be verified from the fact that \(\frac{Q_{xy}}{q_y} = \frac{Q_{yx}}{q_x}\) and \(\frac{F_{r,xy}}{\left(\pi_{r,\mathcal{A}\left(y\right)}\right)^{\beta}} = \frac{F_{r,yx}}{\left(\pi_{r,\mathcal{A}\left(x\right)}\right)^{\beta}}\) as is guaranteed by the reversibility conditions in Equations (6) and (14).