mapmuts_inferdifferentialpreferences.py

This program is no longer the state-of-the-art: You are suggested to instead
use the dms_tools program dms_inferdiffprefs.

This script is designed to infer the differential preferences for each amino acid at each site when comparing two or more selection pressures. It does this using a Bayesian approach implemented via MCMC (Markov chain Monte Carlo).

Because this script uses MCMC, it takes a fairly long time to run – perhaps as long as several days depending on your computer and the specific data set.

This script takes as input the codoncounts.txt files created by mapmuts_parsecounts.py, and so is designed to be run after you have completed the analysis with that script.

Dependencies

This script requires pymc for the Bayesian inference, which in turn requires numpy. If you want the script to make plots, you also need matplotlib. The use of pymc version 2.3 is recommended. This script may not work with versions 3.*, and may work suboptimally with earlier versions.

When would you use this script?

If you just want to infer the site-specific amino-acid preferences by comparing a pre-selection mutant library to a post-selected sample, then you should probably instead use mapmuts_inferpreferences.py, which implements the algorithm described in An experimentally determined evolutionary model dramatically improves phylogenetic fit (in the Methods section entitled Inference of the Amino Acid Preferences).

This script is tailored to the case where you have several samples that have been selected in slightly different conditions, and you want to identify sites and amino acids that are selected differently in the different conditions. Most commonly, you would have already selected for basic functionality, and are now looking for differential functionality under various selections. This script identifies the differential preferences in these different conditions. The algorithm here is still similar in many respects to that described in An experimentally determined evolutionary model dramatically improves phylogenetic fit, but now is elaborated to infer the differential preferences.

The samples that compose the data

We assume that we are sequencing the following samples:

  • error_control: This sample is used to measure the error rate associated with sequencing and sample processing. It will generally be the unmutated gene or a wildtype virus grown from unmutated gene, whichever better captures the errors that afflict the samples. For influenza, the errors (or unaccounted for mutations) generally come from mutations during viral replication, errors during reverse transcription, errors during PCR when preparing the sample, and sequencing errors. So here, the appropriate sample is generally wildtype virus grown from an unmutated gene.
  • starting_sample: This is the sample that was the starting point for the selections. It could be your mutant library, but more typically it will be a mutant virus pool that you have already grown from your mutant library and are now subjecting to further selection. An implicit assumption of the way that the error correction with error_control is done is that at all sites this sample are still mostly wildtype (the mutations are relatively rare).
  • control_selection: This is starting_sample passaged in your control condition. For instance, this might be virus that is passaged a second time in a control cell line or animal. An implicit assumption of the way that the error correction with error_control is done is that all sites in this sample are still mostly wildtype (the mutations are relatively rare).
  • selection_1: This is starting_sample passaged in the first selection condition. For instance, this might be virus that is passaged a second time under immune selection. An implicit assumption of the way that the error correction with error_control is done is that all sites in this sample are still mostly wildtype (the mutations are relatively rare).
  • selection_2, ..., selection_i: These are an arbitrary number of additional samples like selection_1 that are passaged in other selection conditions.

Let \(n^{\rm{error}}_{r,x}\), \(n^{\rm{start}}_{r,x}\), \(n^{\rm{control}}_{r,x}\), and \(n^{s_i}_{r,x}\) be the observed number of counts of codon \(x\) at site \(r\) in the sequencing of error_control, starting_sample, control_selection, and selection_i samples, respectively. Define \(\overrightarrow{n}^{\rm{error}}_{r}\) as the vector of \(n^{\rm{error}}_{r,x}\) values for all codons \(x\), and make similar definitions for \(\overrightarrow{n}^{\rm{start}}_{r}\), \(\overrightarrow{n}^{\rm{control}}_{r}\), and \(\overrightarrow{n}^{s_i}_{r}\). Define \(N^{\rm{error}}_{r} = \sum_x n^{\rm{error}}_{r,x}\) be the total number of counts of any codon at site \(r\) in the error_control sample, and make similar definitions for \(N^{\rm{start}}_{r}\), \(N^{\rm{control}}_{r}\) and \(N^{s_i}_{r}\).

The likelihoods in terms of unknown parameters

The error_control which is just sequencing of the wildtype gene. Let \(\rm{wt}_r\) denote the wildtype codon at site \(r\). Let \(\epsilon_{r,x}\) be the rate that site \(r\) is read to be some mutant codon \(x \ne \rm{wt}_r\), and let \(\epsilon_{r,\rm{wt}_r} = 1 - \sum_{y \ne \rm{wt}_r} \epsilon_{r,x}\). Define \(\overrightarrow{\epsilon}_r\) as the vector of \(\epsilon_{r,x}\) values for all \(x\). The likelihood of observing \(\overrightarrow{n}_{\rm{error},r}\) is

\[\Pr\left(\overrightarrow{n}^{\rm{error}}_{r} \mid N^{\rm{error}}_{r}, \overrightarrow{\epsilon}_r\right) = \operatorname{Multi}\left(\overrightarrow{n}^{\rm{error}}_{r}; N^{\rm{error}}_{r}, \overrightarrow{\epsilon}_r\right)\]

where \(\operatorname{Multi}\) denotes the multinomial distribution.

The starting_sample gives the starting frequencies for the various variants before the selections. We assume that all sites are mostly the wildtype codon, and so account for errors from the wildtype codon to mutant codons (as quantified by the error_control sample), but do not account for errors from mutant codons to other codons. Let \(f_{r,x}\) be the frequency of codon \(x\) at site \(r\) with \(\sum_x f_{r,x} = 1\), and let \(\overrightarrow{f}_r\) be the vector of \(f_{r,x}\) values for all codons \(x\). The likelihood of observing \(\overrightarrow{n}^{\rm{start}}_{r}\) is

\[\Pr\left(\overrightarrow{n}^{\rm{start}}_{r} \mid N^{\rm{start}}_{r}, \overrightarrow{\epsilon}_r, \overrightarrow{f}_r\right) = \operatorname{Multi}\left(\overrightarrow{n}^{\rm{start}}_{r}; N^{\rm{start}}_{r}, \overrightarrow{\epsilon}_r + \overrightarrow{f}_r - \overrightarrow{\delta}_r\right)\]

where \(\overrightarrow{\delta}_r\) is the vector with elements \(\delta_{x, \rm{wt}_r}\) that has one in the element corresponding to the wildtype codon and zero for all othere elements.

The control_selection gives the frequencies after the starting_sample has been passaged in the control condition. These frequencies may change substantially depending on the selection pressures that generated starting_sample versus those acting on the control_selection. We assume that the selection acts entirely on the amino-acid sequence, such that all synonymous codons that encode the same amino-acid are equivalent. Let \(\mathcal{A}\left(x\right)\) denote the amino acid encoded by codon \(x\). Let \(\pi_{r,a}\) be the preference for amino-acid \(a\) at site r in the control_selection, with the constraint that \(\sum_a \pi_{r,a} = 1\). Let \(\overrightarrow{\pi}_r\) be the vector of \(\pi_{r,a}\) values for all amino acids \(a\). Define the vector-valued function \(\overrightarrow{\mathcal{C}}\) as a mapping for the value of each codon \(x\) to the corresponding value for its amino acid \(\mathcal{A}\left(x\right)\), such that \(\overrightarrow{\mathcal{C}}\left(\overrightarrow{\pi}_{r}\right) = \left(\ldots, \pi_{r,\mathcal{A}\left(x\right)}, \ldots\right)\) for all codons \(x\). The expected frequency of codon \(x\) at site \(r\) in the control sample is defined in terms of the preferences as \(\frac{f_{r,x} \times \pi_{r,\mathcal{A}\left(x\right)}}{\sum_x f_{r,x} \times \pi_{r,\mathcal{A}\left(x\right)}}\). Therefore, if we again assume that all sites in the control_selection sample are mostly wildtype so that we only account for errors from the wildtype codon to some other codon, then the likelihood of observing \(\overrightarrow{n}^{\rm{control}}_{r}\) is

\[\Pr\left(\overrightarrow{n}^{\rm{control}}_{r} \mid N^{\rm{control}}_{r}, \overrightarrow{\epsilon}_r, \overrightarrow{f}_r, \overrightarrow{\pi}_r\right) = \operatorname{Multi}\left(\overrightarrow{n}^{\rm{control}}_{r}; N^{\rm{control}}_{r}, \overrightarrow{\epsilon}_r + \frac{\overrightarrow{f}_r \circ \overrightarrow{\mathcal{C}}\left(\overrightarrow{\pi}_r\right)}{\overrightarrow{f}_r \cdot \overrightarrow{\mathcal{C}}\left(\overrightarrow{\pi}_r\right)} - \overrightarrow{\delta}_r \right)\]

where \(\cdot\) indicates the vector dot product and \(\circ\) indicates the Hadamard (entry-wise) vector product.

We are most interested in how the other selections differ from the control selection – in particular whether there is increased or decreased selection for some amino acids at certain sites. Let \(\Delta\pi^{s_i}_{r,a}\) denote the difference in the preference for amino-acid \(a\) at site \(r\) for selection_i (denoted by \(s_i\)). If selection_i is identical to the control selection, then we expect these differences in preferences to all be zero. Even when they are not zero, we have the constraint \(\sum_a \Delta\pi_{s_i}^{r,a} = 0\), since any increase in preference for one amino acid must be compensated by a decrease in preference for other amino acids. Let \(\overrightarrow{\Delta\pi}^{s_i}_{r}\) be the vector of \(\Delta\pi^{s_i}_{r,a}\) values for all amino acids \(a\). If we again assume that all sites in the control_selection sample are mostly wildtype so that we only account for errors from the wildtype codon to some other codon, then the likelihood of observing \(\overrightarrow{n}^{s_i}_{r}\) is

\[\Pr\left(\overrightarrow{n}^{s_i}_{r} \mid N^{s_i}_{r}, \overrightarrow{\epsilon}_r, \overrightarrow{f}_r, \overrightarrow{\pi}_r, \overrightarrow{\Delta\pi}^{s_i}_{r}\right) = \operatorname{Multi}\left(\overrightarrow{n}^{s_i}_{r}; N^{s_i}_{r}, \overrightarrow{\epsilon}_r + \frac{\overrightarrow{f}_r \circ \overrightarrow{\mathcal{C}}\left(\overrightarrow{\pi}_r + \overrightarrow{\Delta\pi}^{s_i}_{r}\right)} {\overrightarrow{f}_r \cdot \overrightarrow{\mathcal{C}}\left(\overrightarrow{\pi}_r + \overrightarrow{\Delta\pi}^{s_i}_{r}\right)} - \overrightarrow{\delta}_r \right)\]

Priors over the unknown parameters

We need to place priors over our unknown parameters, which are: \(\overrightarrow{\epsilon}_r\), \(\overrightarrow{f}_r\), \(\overrightarrow{\pi}_r\), and \(\overrightarrow{\Delta\pi}_{s_i,r}\) for all selections \(s_i\).

The prior over \(\overrightarrow{\epsilon}_r\) is defined exactly as in the 13th equation of An experimentally determined evolutionary model dramatically improves phylogenetic fit. Briefly, the prior is

\[\Pr\left(\overrightarrow{\epsilon}_r\right) = \operatorname{Dir}\left(\overrightarrow{\epsilon}_r; n_{\rm{codons}} \cdot \sigma_{\epsilon} \cdot \overrightarrow{\alpha}_{\epsilon,r}\right)\]

where \(\operatorname{Dir}\) is the Dirichlet distribution, \(n_{\rm{codons}}\) is the number of different codon identities (typically either 64 or 61, depending on whether stop codons are considered), \(\sigma_{\epsilon}\) is the scalar concentration parameter, and \(\overrightarrow{\alpha}_{\epsilon,r}\) is the vector giving the average mutation rate for all codon mutations with that number of mutations over all sites in the error_control sample, exactly as defined in the description of the 13th equation of An experimentally determined evolutionary model dramatically improves phylogenetic fit.

The prior over \(\overrightarrow{f}_r\) is defined similarly to the prior over \(\overrightarrow{\mu}_r\) defined in the 12th equation of An experimentally determined evolutionary model dramatically improves phylogenetic fit. Specifically, the prior is

\[\Pr\left(\overrightarrow{f}_r\right) = \operatorname{Dir}\left(\overrightarrow{f}_r; n_{\rm{codons}} \cdot \sigma_{f} \cdot \overrightarrow{\alpha}_{f,r}\right)\]

where \(\sigma_{f}\) is the scalar concentration parameter and \(\overrightarrow{\alpha}_{f,r}\) is the vector with elements \(\alpha_{f,r,x} = \frac{\overline{f}}{n_{\rm{codons}} - 1} + \delta_{x,\rm{wt}_r}\left(1 - \overline{f}\times\frac{n_{\rm{codons}}}{n_{\rm{codons}} -1}\right)\) where \(\overline{f}\) is the fraction of all reads in start_sample that do not give the wildtype codon \(\rm{wt}_r\) averaged over all sites, and \(\delta_{x,\rm{wt_r}}\) is the Kronecker-delta function which equal to one if codon \(x\) is the wildtype codon \(\rm{wt}_r\) and zero otherwise. In other words, the prior estimate is that all non-wildtype amino acids (mutations) are present at the average rate that any mutation is found over the entire gene. This is a reasonable uninformative prior.

The prior over \(\overrightarrow{\pi}_{r}\) is defined exactly as in the 15th equation of An experimentally determined evolutionary model dramatically improves phylogenetic fit. Specifically, the prior is

\[\Pr\left(\overrightarrow{\pi}_r\right) = \operatorname{Dir}\left(\overrightarrow{\pi}_r; \sigma_{\pi} \cdot \overrightarrow{1}\right)\]

where \(\sigma_{\pi}\) is the scalar concentration parameter and \(\overrightarrow{1}\) is a vector that is all ones. In other words, the prior is a symmetric Dirichlet distribution, so the a prior estimate is that all amino acids are equally preferred at each site.

We also need to define priors over the differential preferences \(\overrightarrow{\Delta\pi}^{s_i}_r\) under the alternative selections. We would like these priors to be such that the mean prior estimate for \(\Delta\pi^{s_i}_{r,a}\) for all sites \(r\) and amino acids \(a\); in other words, we would like the priors to not favor any tendency to estimate non-zero differential preferences. We also need the priors to enforce the conditions that \(\sum_a \Delta\pi^{s_i}_{r,a} = 0\) and that \(0 \le \pi^{s_i}_{r,a} + \Delta\pi^{s_i}_{r,a} \le 1\) for all \(a\). These conditions therefore require that the priors over \(\overrightarrow{\Delta\pi}^{s_i}_{r}\) be defined in terms of \(\overrightarrow{\pi}_r\). Specifically, we define

\[\Pr\left(\overrightarrow{\Delta\pi}^{s_i}_r \mid \overrightarrow{\pi}_r\right) = \operatorname{Dir}\left(\overrightarrow{\Delta\pi}^{s_i}_r + \overrightarrow{\pi}_r; \sigma_{\Delta\pi} \cdot n_{\rm{aa}} \cdot \overrightarrow{\pi}_r \right)\]

where \(\sigma_{\Delta\pi}\) is the concentration parameter (assumed to be the same for all selections \(s_i\)) and \(n_{\rm{aa}}\) is the number of amino acids.

Inference with MCMC

The goal of the inference is to determine the differential preferences \(\overrightarrow{\Delta\pi}^{s_i}_r\) for all selection pressures \(s_i\). In some cases we might also be interested in the preferences \(\overrightarrow{\pi}_r\) for the control_selection.

The other parameters \(\overrightarrow{\epsilon_r}\) and \(\overrightarrow{f}_r\) are nuisance parameters, with values that must be calculated but are not of direct interest for the analyses performed by this script.

Finally, the concentration parameter used for each priors (\(\sigma_{\epsilon}\), \(\sigma_{f}\), \(\sigma_{\pi}\), and \(\sigma_{\Delta\pi}\) are variables that must be specified a priori by the user of the script.

The posterior distributions over the parameters of interest are inferred by MCMC (Markov Chain Monte Carlo). Typically the values will be summarized by their posterior means. The sampling is done over the products of the likelihoods and priors specified in the sections above. Specifically, we sample from the following posterior:

\[\begin{split}\rm{posterior} \propto& \Pr\left(\overrightarrow{n}^{\rm{error}}_r \mid N_r^{\rm{error}}, \overrightarrow{\epsilon}_r\right) \times \\ & \Pr\left(\overrightarrow{n}^{\rm{start}}_{r} \mid N^{\rm{start}}_{r}, \overrightarrow{\epsilon}_r, \overrightarrow{f}_r\right) \times \\ & \Pr\left(\overrightarrow{n}^{\rm{control}}_{r} \mid N^{\rm{control}}_{r}, \overrightarrow{\epsilon}_r, \overrightarrow{f}_r, \overrightarrow{\pi}_r\right) \times \\ & \left(\prod_{s_i} \Pr\left(\overrightarrow{n}^{s_i}_{r} \mid N^{s_i}_{r}, \overrightarrow{\epsilon}_r, \overrightarrow{f}_r, \overrightarrow{\pi}_r, \overrightarrow{\Delta\pi}^{s_i}_{r}\right)\right) \times \\ & \Pr\left(\overrightarrow{\epsilon}_r\right) \times \\ & \Pr\left(\overrightarrow{f}_r\right) \times \\ & \Pr\left(\overrightarrow{\pi}_r\right) \times \\ & \left(\prod_{s_i} \Pr\left(\overrightarrow{\Delta\pi}^{s_i}_r \mid \overrightarrow{\pi}_r\right) \right)\end{split}\]

Implementation of the MCMC

The posterior distributions for the differential preferences and preferences are estimated by MCMC using the pymc package. The script automatically tests for convergence using the convergence and nruns options. It is suggested that you use nruns > 1 and stepincrease > 1 so that you can try to ensure convergence. Substantial effort has been invested to make sure that by default the run will burn-in appropriately and set reasonable step sizes for the variables. The details of this implementation are documented in the source code.

The log file created by this script will include information about whether the inference for each site converged according the criterium set by convergence. If it consistently fails to converge, you might consider increasing nsteps.

When examining convergence, the difference between the differential preferences and preferences inferred from different runs is quantified as the Shannon-Jensen divergence. The logarithms are taken to the base 2, so this divergence can range from zero (identical inferred preferences) to one (completely independent inferred preferences). For the control_selection, the Shannon-Jensen divergence is computed using the preferences \(\pi_{r,a}\). For the other selections, the Shannon-Jensen divergence is computed using the sum of the preferences and differential preferences, \(\pi_{r,a} + \Delta\pi^{s_i}_{r,a}\).

Running the script

To run this script from the prompt, first create a text Input file of the format described below. Then simply run the script with the single text file as the argument, as in:

mapmuts_inferdifferentialpreferences.py infile.txt

Input file

The input file is a text file with a series of key / value pairs. The required keys are indicated below. The values should not include spaces.

Lines beginning with # and empty lines are ignored.

Keys for the input file:

  • error_control : This key should specify a file giving the counts of each codon at each position for the error_control sample; these are the \(\overrightarrow{n}^{\rm{error}}_r\) values. This file should be in the format of a *_codoncounts.txt created by mapmuts_parsecounts.py. Typically, this sample would be from sequencing either an unmutated gene or a wildtype virus to estimate the baseline error rate.
  • starting_sample : This key specifies a file giving the counts that are the \(\overrightarrow{n}^{\rm{start}}_r\) values. This file is in the same format as for error_control. Typically, this sample would be the pool of functional mutant variants or viruses that were created from the mutant library and are now being subjected to the subsequent selections.
  • control_selection : This key specifies a file giving the counts that are the \(\overrightarrow{n}^{\rm{control}}_r\) values. This file is in the same format as for error_control. Typically, this sample would be the pool in starting_sample that has been subjected to the control growth condition, such as passage of a virus in control cells without any additional selection.
  • selection_??? : These keys specify the selection_i conditions used to look for differential selection relative to control_selection. There must be at least one of these keys, but there can be arbitrarily more. The actual key should take the form selection_heat or selection_antibody or selection_ followed by any non-whitespace-containing string giving the name assigned to the additional selection. The name given to the selection condition in the results is whatever is the suffix that follows the underscore in the selection_??? string. Typically, these samples would be starting_sample passaged in a condition that is being compared to control_selection. The values should be codon counts files in the same format as for error_control.
  • pi_concentration : the concentration parameter of the symmetric Dirichlet distribution used as the prior over the preferences \(\pi_{r,a}\) values. This is the parameter denoted as \(\sigma_{\pi}\) above. Note that the concentration parameter here is taken to represent the value of each of the individual elements in the parameter vector. Choosing a value of pi_concentration = 1 corresponds to a uniform prior over all of the possible vectors of pi_concentration values, and is probably what you want to choose unless you have a good reason to do otherwise. A value of pi_concentration > 1 favors vectors where all pi_concentration values are similar, and a value of pi_concentration < 1 favors vectors where some \(\pi_{r,a}\) values are large and others are small.
  • epsilon_concentration : the concentration parameter of the (non-symmetric) Dirichlet distribution used as a prior over the error rate observed in the error_control library. This is the parameter denoted as \(\sigma_{\epsilon}\) above. The suggested value is one.
  • f_concentration : the concentration parameter of the (non-symmetric) Dirichlet distribution used as a prior over the frequencies in the starting_sample. This is the parameter denoted as \(\sigma_{f}\) above. The suggested value is one.
  • deltapi_concentration : the concentration parameter of the (non-symmetric) Dirichlet distribution used as a prior over differential preferences. This is the parameter denoted as \(\sigma_{\Delta\pi}\) above. The suggested value is one.
  • minvalue : a number specifying the minimum value for the prior estimates for the error rates (\(\overrightarrow{\epsilon}_r\)) determined from the overall library means. Having such a minimum value keeps any of the prior estimates from being set to zero or negative values. A reasonable value is 1e-7.
  • seed : integer seed for the random number generator. Runs with the same seed should generate exactly the same output. Otherwise the output may differ for different seeds as the MCMC uses random numbers.
  • nruns : the number of independent MCMC runs. Must be at least one, but it is recommended that you use nruns equal to at least 2 (and probably 3), as this allows checks for MCMC convergence. When multiple runs are performed, the reported equilibrium preferences come from all runs.
  • nsteps : the number of steps for each MCMC run. A reasonable starting value is probably 200000, although this may need to be increased if convergence is poor. Note that the number of burn-in steps are not explicitly specified in this script, but the program implements internal methods that should ensure adequate burn-in that is not included in the output. You will have to look at the source code if you want to understand that in detail.
  • thin : the posterior is only recorded every thin steps. If you want to record every value, then set thin to one. However, consecutive steps are usually correlated, so it is often preferable to set thin to a number greater than one. A suggested value is 200. nsteps must be a multiple of thin.
  • convergence : the test applied to the MCMC runs to see if they converged. This option is only meaningful if nruns > 1 (as is recommended). For each pair of MCMC runs, we calculate the Shannon-Jensen divergence (logarithm base 2) between the preferences and differential preferences for that value of r. If the runs converge to identical values, this difference will be zero, whereas its maximum value for highly diverged runs is one. We test whether the actual divergence is less than the value specified by convergence. If the divergence is greater than convergence for any of the preferences or differential preferences, then what is done next depends on the values of stepincrease. A recommended value for convergence is 0.02. The log file contains information about whether the inference converged for each site.
  • stepincrease : If the chain fails to converge for a given inference (according to the criterion of convergence), then the default action is to try to increase the number of steps to be equal to stepincrease * nsteps. We then test again for convergence for these longer MCMC runs. Typically increasing the step number might improve convergence. So stepincrease must be a number >= 1. If it is one, that is equivalent to not doing any stepincrease. If stepincrease is greater than one, we then try running the MCMC for stepincrease * nsteps steps for each run. If the chain still does not converge, nothing further is done, but we do use the inferences from the longer chain. A suggested value of stepincrease is 4.
  • MCMC_traces is the directory to which we write plots showing the traces of the preferences and differential preferences as a function of the number of MCMC steps after burn-in and thinning. If you set this to the string None, then no such plots are created. Using this option requires matplotlib. If this directory does not already exist, it is created.
  • preference_plots is the directory to which we write plots showing the preferences and differential preferences for all amino acids at a site. Shown are the posterior mean and the median-centered 95% credible interval of the posterior. If you set this to the string None, then no such plots are created. Using this option requires matplotlib. If this directory does not already exist, it is created.
  • outfileprefix is a string giving the prefix for the Output files described below. If you do not want a prefix on the Output files, then set this option to None.
  • ncpus is an option that allows you to run the analyses using multiple CPUs. This will usually be helpful, because these runs can take a long time. If you set ncpus to 1, then we just use one CPU. But if you are working on a multi-threaded processor, then setting to a larger number will take advantage of the additional CPUs. Note however that if you set ncpus to a number larger than the number of spare CPUs, you will overload the processor – so if you only have one CPU, set this to just 1.

Example input file

Here is an example input file:

# Input file for mapmuts_inferdifferentialpreferences.py
error_control virus_codoncounts.txt
starting_sample mutvirus_codoncounts.txt
control_selection no_serum_codoncounts.txt
selection_vaccinated_serum vaccinated_serum_codoncounts.txt
selection_infected_serum infected_serum_codoncounts.txt
pi_concentration 1.0
epsilon_concentration 1.0
f_concentration 1.0
deltapi_concentration 1.0
minvalue 1e-7
seed 1
nruns 3
nsteps 200000
thin 200
convergence 0.02
stepincrease 4
MCMC_traces MCMC_traces/
preference_plots preference_plots/
outfileprefix None
ncpus 4

Output files

Output files are created with the prefix specified by outfileprefix followed by suffixes shown below. These files have the following suffixes:

  • inferdifferentialpreferences_log.txt : a text file logging the progress of this script.

  • preferences_control_selection.txt : a text file giving the MCMC-inferred preferences \(\pi_{r,a}\) for each amino acid \(a\) at each site \(r\) for the control_selection. Stop codons (denoted by the * character) are included as possible identities. The file also gives the site entropy \(h_r = \sum_a \pi_{r,a} \log_2 \pi_{r,a}\) in bits for each site.

    The first line is a title line that begins with a # character and then lists the column headers separated by tabs. The remaining lines contain the data.

    The columns contain tab-separated values as follows:

    • The first column gives the residue number in 1, 2, ... numbering. The header for this column is SITE.

    • The second column gives the wildtype amino acid at the site (such as N or R). The header for this column is WT_AA.

    • The third column lists the site entropy \(h_r\). The header for this column is SITE_ENTROPY.

    • The remaining 21 columns list the preference \(\pi_{r,a}\) values for each amino acid \(a\) at this site \(r\). The amino acids are listed in alphabetical order according to their one letter codes, and then the final column gives the equilibrium preference for a stop codon (denoted by a * character). The headers for these columns are PI_A, PI_C, PI_D, ..., PI_W, PI_Y, PI_*. Here is an example of a few lines:

      #SITE   WT_AA   SITE_ENTROPY    PI_A    PI_C    PI_D    PI_E    PI_F    PI_G    PI_H    PI_I    PI_K    PI_L    PI_M    PI_N    PI_P    PI_Q    PI_R    PI_S    PI_T    PI_V    PI_W    PI_Y    PI_*
      1   M   0.885694    0.000680047 0.00185968  8.48048e-06 0.00128217  0.0289946   0.000654589 0.0212144   0.0205306   0.0021597   0.00056726  0.876551    0.00130602  0.000789781 0.00144568  0.000414457 0.00071164  0.00126055  0.00167084  0.0352283   0.00176486  0.000905383
      2   A   1.1671  0.709575    0.00177747  3.79503e-05 0.00298854  0.00390524  0.00131246  0.00379599  0.000913006 0.00121497  0.000535209 0.00276335  0.00140061  0.000694694 0.0032134   0.00464972  0.000463924 0.000908368 0.254241    0.00335749  0.00137437  0.000876985
      3   N   3.2593  0.129992    0.000540239 4.2514e-06  0.00171887  0.0212062   0.0006086   0.102743    0.0431489   0.00524637  0.0771062   0.021714    0.0641921   0.00338082  0.0149898   0.0106522   0.13886 0.0247841   0.00607319  0.0500692   0.282389    0.000581555
      
  • preferences_control_selection_credibleintervals_95.txt : a text file giving the median-centered 95% credible intervals on the \(\pi_{r,a}\) values in preferences_control_selection.txt. The first column is SITE and the remaining columns give the upper and lower bounds on the credible intervals for each of the \(\pi_{r,a}\) values as comma-separated numbers (no spaces) in the same order that they are listed in preferences_control_selection.txt. Here is an example:

    #SITE   PI_A_95cred PI_C_95cred PI_D_95cred PI_E_95cred PI_F_95cred PI_G_95cred PI_H_95cred PI_I_95cred PI_K_95cred PI_L_95cred PI_M_95cred PI_N_95cred PI_P_95cred PI_Q_95cred PI_R_95cred PI_S_95cred PI_T_95cred PI_V_95cred PI_W_95cred PI_Y_95cred PI_*_95cred
    1   1.49582e-05,0.00269598  7.2227e-05,0.00604361   4.37378e-06,1.17951e-05 3.26837e-05,0.00433571  0.0114465,0.0491834 1.82621e-05,0.00208322  0.00989126,0.0318659    0.00135281,0.0554989    4.87099e-05,0.00754094  1.41237e-05,0.00202465  0.821886,0.925528   3.69704e-05,0.00458556  1.94864e-05,0.0027967   3.89356e-05,0.00557027  8.44362e-06,0.00134572  1.76935e-05,0.0025792   3.51256e-05,0.00460279  3.68112e-05,0.00664131  0.0270458,0.0403275 5.19957e-05,0.00543017  2.3493e-05,0.00336828
    2   0.683813,0.729505   5.48957e-05,0.00581796  3.3971e-05,4.26404e-05  6.74445e-05,0.0109755   4.8097e-05,0.0145215    5.40342e-05,0.00379731  0.000377183,0.00853322  2.13206e-05,0.00337318  3.52239e-05,0.00407263  1.62767e-05,0.00193362  7.56385e-05,0.00950661  4.20842e-05,0.00481382  1.91134e-05,0.00253192  0.000126937,0.0090789   0.0036588,0.00587762    1.0427e-05,0.0016747    2.57371e-05,0.00350532  0.231977,0.267635   9.01372e-05,0.0116063   3.61905e-05,0.00479547  2.30435e-05,0.0032731
    3   0.122949,0.137744   1.51152e-05,0.00177613  1.65846e-07,1.00925e-05 0.000558957,0.0032513   0.015292,0.0269782  4.06388e-05,0.00157236  0.0925438,0.111993  0.0384322,0.0481046 0.00322823,0.00788021   0.0722691,0.0822557 0.0159843,0.0271257 0.0569958,0.0722337 0.000939467,0.00723918  0.0115211,0.0187761 0.00937067,0.0117235    0.135176,0.142903   0.0211448,0.0287984 0.00433432,0.00801638   0.0457743,0.0548462 0.2689,0.29917  2.91378e-05,0.00163938
    
  • Files with suffixes of the form differentialpreferences_selection_???.txt where the ??? indicates the different selections. For instance, if the Input file specifies:

    selection_vaccinated_serum vaccinated_serum_codoncounts.txt
    selection_infected_serum infected_serum_codoncounts.txt
    

    as the alternative selections, then the following two files are created: differentialpreferences_selection_vaccinated_serum.txt and differentialpreferences_selection_infected_serum.txt. These files give the differential preferences \(\Delta\pi^{s_i}_{r,a}\) for each amino acid \(a\) at each site \(r\) for each selection \(s_i\). The files also give the root mean square (RMS) differential preference at each site for each selection. These are defined as \(\Delta\pi^{s_i}_{\rm{RMS},r} = \sqrt{\sum_a \left(\Delta\pi^{s_i}_{r,a}\right)^2}\). The minimal value of the RMS differential preference is zero, and the maximal value is 2.

    The first line is a title line that begins with a # character and then lists the column headers separated by tabs. The remaining lines contain the data.

    The columns contain tab-separated values as follows:

    • The first column gives the residue number in 1, 2, ... numbering. The header for this column is SITE.

    • The second column gives the wildtype amino acid at the site (such as N or R). The header for this column is WT_AA.

    • The third column lists the RMS differential preference for that site, \(\Delta\pi^{s_i}_{\rm{RMS},r}\). The header for this column is RMS_dPI.

    • The remaining 21 columns list the differentila preference \(\Delta\pi^{s_i}_{r,a}\) values for each amino acid \(a\) at this site \(r\) for the selection pressure \(s_i\) described in this file. The amino acids are listed in alphabetical order according to their one letter codes, and then the final column gives the equilibrium preference for a stop codon (denoted by a * character). The headers for these columns are dPI_A, dPI_C, dPI_D, ..., dPI_W, dPI_Y, dPI_*. Here is the header:

      #SITE   WT_AA   RMS_dPI    dPI_A    dPI_C    dPI_D    dPI_E    dPI_F    dPI_G    dPI_H    dPI_I    dPI_K    dPI_L    dPI_M    dPI_N    dPI_P    dPI_Q    dPI_R    dPI_S    dPI_T    dPI_V    dPI_W    dPI_Y    dPI_*
      
  • For each selection pressure, there is also a differentialpreferences_selection_???_credibleintervals_95.txt. These are text files giving the median-centered 95% credible intervals on the \(\Delta\pi^{s_i}_{r,a}\) values in the differentialpreferences_selection_???.txt files. The first column is SITE and the remaining columns give the upper and lower bounds on the credible intervals for each of the \(\Delta\pi^{s_i}_{r,a}\) values as comma-separated numbers (no spaces). Here is the header:

    #SITE   dPI_A_95cred dPI_C_95cred dPI_D_95cred dPI_E_95cred dPI_F_95cred dPI_G_95cred dPI_H_95cred dPI_I_95cred dPI_K_95cred dPI_L_95cred dPI_M_95cred dPI_N_95cred dPI_P_95cred dPI_Q_95cred dPI_R_95cred dPI_S_95cred dPI_T_95cred dPI_V_95cred dPI_W_95cred dPI_Y_95cred dPI_*_95cred
    
  • If the preference_plots option is being used, then plots are created showing the preferences \(\pi_{r,a}\) for the control_sample and the differential preferences \(\Delta\pi_{r,a}^{s_i}\) for each selection \(s_i\). These plots are in the subdirectory specified by preference_plots. Within this subdirectory, they have names beginning with outfileprefix and then having the suffix preferences_control_selection_1.pdf, preferences_control_selection_2.pdf, ... for residues 1, 2, ... for the control_sample preferences. For the differentail preferences, the names are differentialpreferences_selection_???_1.pdf, differentialpreferences_selection_???_2.pdf, ... where ??? gives the name of the selection pressure. These plots show the posterior mean and the median-centered 95% credible intervals.

  • If the MCMC_traces option is being used, then plots are created showing the traces of the control_sample preferences and the differential preferences during the MCMC. These plots are in the subdirectory specified by MCMC_traces, and have names beginning with the prefix given by outfileprefix followed by the suffixes preferences_control_selection_1.pdf, preferences_control_selection_2.pdf`, ... for the control_sample preferences, and followed by the suffix differentialpreferences_selection_???_1.pdf, differentialpreferences_selection_???_2.pdf for the other selection pressures where ??? indicates the name of the selection pressure. These plots show traces of the values during the MCMC.