phyloExpCM_optimizeHyphyTree.py

This script takes a tree of known topology and uses HYPHY to optimize the branch lengths and any substitution model parameters. It can use several common codon models as well as the experimentally based codon models with known mutation rates. However, for experimentally based evolutionary models it will often be more convenient to instead use the newer script phyloExpCM_ExpModelOptimizeHyphyTree.py – and use of this latter script is necessary if the mutation rates are unknown.

This script has been tested with HYPHY versions 2.112*. This script uses several HYPHY include batch files (*.ibf files) that are distributed with the phyloExpCM package in the ./src/data/ subdirectory, and should be installed in a location that the script can find if you have installed this package as instructed. If HYPHY fails with messages about being unable to find included files, maybe this didn’t work.

To run the script, create an Input file as described below. Then run the script from the command line using the input file as the single argument, as in:

phyloExpCM_optimizeHyphyTree.py infile.txt

Note that running this script may take quite a while if HYPHY takes a long time to run. HYPHY can also start to consume a lot of memory for relatively big input trees – so you may need to use a computer with a large amount of RAM.

Note that if you are running multiple versions of this script then you should place them in separate subdirectories. This is important because otherwise the scripts may create temporary files with the same names that interfere with / overwrite each other.

Input file

The input file specifies key / value pairs on individual lines in the format:

key1 value1
key2 value2

Blank lines or lines that begin with # are ignored (i.e. as comment lines).

The input file should contain the following keys:

  • hyphypath should contain a command that can be used to run HYPHY in batch mode. If you have installed HYPHY in the current search path, this might be something like HYPHYMP or HYPHYMP CPU=4 (the latter command specifying how many CPUs to use for this multi-thread version) or HYPHYDEBUG. If HYPHY is not installed in the current search path, you will need to specify a full path to the executable.

  • hyphycmdfile is the name of the created HYPHY batch mode command file that is used to run HYPHY by this script.

  • hyphyoutfile is the name of the created output file that gives the results of the HYPHY analysis.

  • hyphytreefile is the name of the created output file containing the HYPHY optimized tree.

  • hyphydistancesfile is the name of a created output file file that shows the pairwise distances between all pairs of sequences (those in fastafile). Note that these are not the distances from the created HYPHY tree in hyphytreefile. Rather, the substitution model has all parameters frozen to the maximum-likelihood values that are determined by HYPHY during the optimization of hyphytreefile. With these optimized substitution model parameters fixed, HYPHY is then used to estimate the maximum likelihood pairwise distance between each pair of sequences (essentially, the branch length between them in a two-sequence tree). These pairwise distances are then written hyphydistancesfile in the format described in the Output files section below. If you do not want to estimate pairwise distances, then set hyphydistancesfile to None.

    Also, note that pairwise distances can ony be calculated for substitution models in which all parameters are global to all branches (the reason is that the model parameters cannot all be fixed if they are branch local). So if you are using a substitution model with local branch lengths (for instance GY94_CF3x4_omega-branchlocal-one_rates-gamma6), then no pairwise distances can be calculated. In this case, no hyphydistancesfile will be created regardless of whether you specify None or no file name.

    It is also possible to use hyphydistancesfile to calculate pairwise distances to only certain sequences in fastafile. This can be particularly advantageous when using one of the experimental codon models (specified by the models parameter). The reason is that HYPHY can take a long time to calculate pairwise distances for experimental codon models. If you only want to calculate pairwise distances to certain sequences, then include the name of these sequences following they hyphydistancesfile key and the file to be created, as in:

    hyphydistancesfile hyphy_distances.txt A/Brevig_Mission/1918_(H1N1) A/Brisbane/10/2007_(H3N2)
    

    The aforegoing line would cause the script to create the file hyphy_distances.txt, but that file would only include pairwise distances to the two indicated sequences from all other sequences in the file (so if there are N sequences there would be 2N - 1 distances). Note that the names of the sequences that are specified must exactly match full headers in fastafile or an exception is raised.

    Note also that neither the name of the file specified by hyphydistancesfile or any of the specified sequences can contain spaces.

  • fastafile is the name of a FASTA file containing the sequences that comprise the tip nodes. These should be nucleotide coding sequences aligned at the codon level (not at the nucleotide level). Stop codons should be removed, and the aligned sequence length should be a multiple of three. Ambiguous nucleotide identities are not supported. The headers for the sequences should match the names of the tip nodes in treefile.

  • treefile is a Newick-format phylogenetic tree giving the relationship among the sequences in fastafile. The names of the tip nodes must match those in fastafile. Branch lengths can be specified, but they will be optimized by HYPHY. Branch supports can be present, but they are ignored and removed in the output tree.

  • siteslist is the name of a file that specifies all of the codon sites that are included in the analysis in 1, 2, ... numbering. The first line of this file should be a header beginning with a # character, and is skipped. The remaining lines should begin with an integer >= 1 that gives a site number, followed by whitespace – anything following this whitespace is ignored. All of the numbers in this first column are taken to be the numbers of the codon sites included. You can use the *_equilibriumpreferences.txt file generated by the mapmuts script mapmuts_inferpreferences.py and used as input to phyloExpCM_buildHyphyExpCM.py as the value for this argument, since it satisfies these formatting requirements.

  • model specifies the codon substitution model used by HYPHY. Valid values are:

    • experimental hyphyExpCMs.ibf specifies that we use an experimentally determined codon model. The first part of this value (the string experimental) indicates that we are using an experimentally determined model. The second part of this value should be the name of a file (no spaces) that specifies the codon model in HYPHY batch format for each site in sites. Typically, this would have been generated using the phyloExpCM_buildHyphyExpCM.py script. This file either must exist in the current directory or you must provide a full path name, as the file will be included by HYPHY in its analysis.

    • experimental_randomize1 hyphyExpCMs.ibf specifies that we use a experimentally determined codon model, but that we randomize the assignment of experimentally determined site models to different sites. This is done by randomly shuffling the position indices in the alignment. As a result, each site model will be assigned to some random site – not necessarily the one that is supposed to match. You would use this is a control. If the experimental models are good, we might expect them to perform well without randomization but poorly with randomization. For this option, the second argument again specifies the codon model in HYPHY batch file format for each site in sites as generated by phyloExpCM_buildHyphyExpCM.py. The number after randomize in the first argument is the integer seed used to seed the random number generator – different seeds will give different results due to different randomization orders.

    • A number of variants of the Goldman and Yang 1994 (GY94) codon model as delineated below.

      The equilibrium codon frequencies are determined empirically from the frequencies of nucleotides at the three nucleotide positions using either the CF3x4 or the F3x4 model. In general, the CF3x4 model should be preferred unless you have a reason to use F3x4, since the former model corrects numerical inconsistencies with the latter.

      The transition/transversion ratio kappa has one global value applied to all sites and branches which is estimated by maximum likelihood.

      The non-synonymous / synonymous substitution rate ratio omega can be set to either be global so that it is shared among all branches, or can be set to be branchlocal so that each branch has its own value of omega. Taking the branch local approach will obviously greatly increase the number of parameters. In addition, for both global and branchlocal, it can be set to one for one ratio, or gamma with an integer indicating the number of rate categories (such as gamma6) drawn from a discrete gamma distribution as originally described in Yang 1994.

      In addition, there can be rate variation across sites. Currently, it is only supported to allow rate variation where the rates are drawn with equal probability from a discrete gamma distribution as originally described in Yang 1994. The gamma distribution alpha parameter is estimated by maximum likelihood. This is specified by the word rates followed by one if there is only one rate (no rate variation). Otherwise rates is followed by gamma with an integer indicating the number of rate categories, such as gamma6 (probably a good value).

      Based on these descriptions, here are some valid model specification strings (other combination also are allowed following this format):

      • GY94_CF3x4_omega-global-one_rates-one is CF3x4 model for equilibrium codon frequencies, one global omega (dN/dS) ratio, and one rate for all sites.
      • GY94_CF3x4_omega-global-one_rates-gamma6 is CF3x4 model of equilibrium codon frequencies, one global omega (dN/dS ratio), and rate variation across sites modeled by six categories drawn from a gamma distribution with the shape parameter estimated by maximum likelihood.
      • GY94_CF3x4_omega-global-gamma6_rates-one is CF3x4 model of equilibrium codon frequencies, an omega (dN/dS) ratio that is drawn from six gamma-distributed categories with both the mean and the shape estimated by maximum likelihood, and one rate for all sites.
      • GY94_CF3x4_omega-global-gamma6_rates-gamma6 is CF3x4 model of equilibrium codon frequencies, an omega (dN/dS) ratio that is drawn from six gamma-distributed categories with both the mean and the shape estimated by maximum likelihood, and rate variation across sites drawn from six gamma-distributed categories with the shape prameter estimated by maximum likelihood.
      • GY94_CF3x4_omega-branchlocal-one_rates-gamma6 is like GY94_CF3x4_omega-global-one_rates-gamma6 except now each branch gets its own single dN/dS ratio omega that is estimated by maximum likelihood.
      • GY94_CF3x4_M1a_rates-one is the M1a model described by Yang et al 2005.
      • GY94_CF3x4_M2a_rates-one is the M2a model described by Yang et al 2005.
      • GY94_CF3x4_M7_rates-one is the M7 model described by Yang et al 2005 with 10 equal probability categories in the beta distribution.
      • GY94_CF3x4_M8_rates-one is the M8 model described by Yang et al 2005 with 10 equal probability categories in the beta distribution plus one additional discrete category.

      Note that it is NOT allowed to combined branchlocal with gamma for the omega as this introduces too many parameters. So the following would NOT be valid: GY94_CF3x4_omega-branchlocal-gamma6_rates-one.

    • A number of variants of the KOSI07 empirical codon model described in Kosiol et al, 2007. That reference delineates a large list of model variants. The ones implemented here are all variants of what is described as the ECM+F+omega+1kappa(tv) in the original reference. Specifically, the codon exchangeabilities are those given by Kosiol et al, 2007, and allow multi as well as single-nucleotide codon changes. The equilibrium codon frequencies are empirically estimated as the frequencies in the data (the +F option). There is a parameter omega scaling non-synonymous substitution rates relative to synonymous rates. Codon changes with ntv transversions are scaled to a rate of kappa^ntv relative to mutations without transversions. Similarly to the GY94 models, omega can be global to all branches with one estimated value, global to all branches with gamma distributed values, or local to branches with one estimated value per branch. The substitution rate can have a single value (no parameters in this case since it just involves branch length scaling) or have gamma distributed values with an estimated shape parameter. kappa is always global and estimated from the data. These options are specified with suffixes similar to the GY94 model, such as:

      • KOSI07_F_omega-global-gamma6_rates-gamma6
      • KOSI07_F_omega-branchlocal-one_rates-one
  • persitelikelihoods is an optional argument. If it is None, False, or not specified, the nothing is done. Otherwise, set it to the name of a file that you would like to create that reports the per-site likelihoods. Specifically, the created file lists the per-site likelihood for each site after the mutation rates and branch lengths have been optimized by maximum likelihood. These parameters and branch lengths are not re-optimized for each site – rather, they are set to their global best values and then the likelihood contribution of each site is computed. The per-site likelihoods are listed in tab-delimited columns with each site on a separate line, and the first column giving the site number and second giving the per-site likelihood. The first line in the file is a header.

Example input file

Here is an example input file:

# Example input file for phyloExpCM_optimizeHyphyTree.py
hyphypath HYPHYMP CPU=4
hyphycmdfile hyphy_cmds.bf
hyphyoutfile hyphy_output.txt
hyphytreefile hyphy_tree.newick
hyphydistancesfile hyphy_distances.txt
fastafile Human_NPs.fasta
treefile Human_NPs_codonphyml_tree.newick
siteslist equilibriumpreferences.txt
model experimental hyphyExpCMs_FracTolerated.ibf

Output files

Some summary output is printed to standard output. In addition, the following output files are created. If any of these files already exist, they are overwritten:

  • hyphycmdfile is a HYPHY batch file that contains the commands used to run HYPHY in batch mode.

  • hyphyoutfile is an output file that contains the results of the HYPHY analysis. This file may contain many lines dealing with branch lengths and tree topology that you will probably want to ignore in favor of just looking at hyphytreefile, but you will want to look at the first few lines of this file, which will have the following format:

    Log likelihood: -671.317
    independent parameters (includes branch lengths): 45
    shared parameters: 0
    number of branch lengths: 45
    number of tip nodes: 24
    number of internal branches: 21
    
  • hyphytreefile is an output file that contains the tree with the HYPHY optimized branch lengths in Newick format. The branch lengths are t parameters from HYPHY. For experimentally determined codon models, these are equal to the expected substitutions averaged over all sites. For GY94 models, these are proportional (not equal) to the expected number of substitutions at the site if omega is global but not if it is branchlocal. Any branch supports present in the original tree are removed.

  • hyphydistancesfile is an output file that shows the pairwise distances between all sequences in fastafile. If there are N such sequences, then there are \(\frac{N (N-1)}{2}\) pairwise distances, and so hyphydistancesfile contains this many lines. However, if specific headers are specified following hyphdistancesfile such that we only estimate distances for a few sequences, then there will be fewer lines – for instance, if two sequences are specified, then we have 2N - 1 distances. Each line has three entries:

    1. The name of the first sequence (the full header in fastafile)
    2. The name of the second sequence (the full header in fastafile)
    3. The pairwise distance between the sequences.

    These entries are separated by a tab.

    For example, here are a few possible lines:

    Sequence1   Sequence2    0.800143
    Sequence1   Sequence3    0.92947
    Sequence2   Sequence3    2.13376
    
  • If you use the persitelikelihoods option, then an output file is created that shows the per-site likelihoods. The format is as follows:

    #SITE   SITE_LOG_LIKELIHOODS
    1   -4.792906363107919
    2   -15.89808216031635
    3   -17.19022748421175
    
  • The script also creates several temporary files that re-map the sequences in fastafile and treefile to names that are usable by HYPHY (which only accepts alphanumeric characters and underscores). Specifically, these files begin with the prefix _codenames_ and then are followed by suffixes which are the names of fastafile, treefile, and hyphydistancesfile. If the script terminates normally, these files are then deleted – but they will be present when the script is running, and will be preserved if it halts prematurely to aid in debugging.