phyloExpCM_multiHyphyRuns.py

This script implements multiple runs of phyloExpCM_optimizeHyphyTree.py that are all being run on the same tree and sequences but are using different substitution models. Essentially, you can use this to automate multiple runs of the phyloExpCM_optimizeHyphyTree.py to compare different substitution models. For much of the details about the options for this script, see the documentation for phyloExpCM_optimizeHyphyTree.py.

Each run is performed in its own subdirectory, which is created by this script. All of the HYPHY results will be in this subdirectory for each run, with the exception of the summaryfile which is created in the main directory used to run the script.

To run the script, create an input file of the format described below. Then run the script followed by the input file, as in:

phyloExpCM_runcodonPhyML.py infile.txt

Note that running this script may take quite a while if HYPHY takes a long time to run. HYPHY can also start to consume a lot of memory for relatively big input trees.

Format of the input file

The input file specifies key / value pairs on individual lines in the format:

key1 value1
key2 value2

Blank lines or lines that begin with # are ignored (i.e. as comment lines). The entries should appear in the order listed below and shown in the example input file.

The input file should contain the following lines:

  • The first seven lines should specify the keys listed below. These keys have exactly the same meaning as for the phyloExpCM_optimizeHyphyTree.py script. These keys are:

    • hyphypath
    • hyphycmdfile : should be relative, not absolute, file name
    • hyphyoutfile : should be relative, not absolute, file name
    • hyphytreefile : should be relative, not absolute, file name
    • hyphydistancesfile : should be relative, not absolute, file name. Make this None if you do not want to create these files. Note that no file will be created for substitution models with branch local parameters.
    • fastafile
    • treefile
    • siteslist
  • The key summaryfile specifies the name of the output file (CSV format) that summarizes the results from all of the HYPHY runs. The format is described below under Output files.

  • The key nmulti specifies the integer number of processes that are run at a single time. Each process is a separate run of phyloExpCM_optimizeHyphyTree.py. If you just want to run the analyses one at a time, set this value to 1. Otherwise if you set it to more than one, that many processes will be run simultaneously. This might make sense to do if you are using a multi-processor machine. However, when choosing the right value, do note that some HYPHY executables use multiple processes (such as HYPHYMP CPU=2), and also that HYPHY runs can consume large amounts of RAM which is sometimes limiting.

  • The remaining lines should list the different substitution models that are used. Each of these lines should begin with an entry that gives the directory (no spaces) that will be created for the run. It should then be followed by text that provides a valid value for the model parameter for input to phyloExpCM_optimizeHyphyTree.py. This script will go through each of these listed substitution models, create the directory, and then run phyloExpCM_optimizeHyphyTree.py within that directory. For example, the entry:

    GY94 GY94_CF3x4_omega-global-one_rates-one
    

    will lead to the creation of the subdirectory ./GY94/ and then the running of phyloExpCM_optimizeHyphyTree.py within that subdirectory using GY94_CF3x4_omega-global-one_rates-one as the model input parameter. Each of the directory names must be unique. If any of the directories already exist, an exception is raised. If the experimental model option is used, the specified .ibf file giving the experimental HYPHY substitution model should be in the home directory used to run phyloExpCM_multiHyphyRuns.py. The input file created for each of the runs within its own subdirectory will have the name phyloExpCM_optimizeHyphyTree_infile.txt.

Example input file

Here is an example input file:

# Example input file for phyloExpCM_multiHyphyRuns.py
hyphypath HYPHYMP CPU=2
hyphycmdfile hyphy_cmds.bf
hyphyoutfile hyphy_output.txt
hyphytreefile hyphy_tree.newick
hyphydistancesfile hyphy_distances.txt
fastafile Human_NPs.fasta
treefile Human_NPs_codonphyml_tree.newick
siteslist equilibriumpreferences.txt
summaryfile hyphy_summary.csv
nmulti 6
GY94 GY94_CF3x4_omega-global-one_rates-one
GY94_gamma-rates GY94_CF3x4_omega-global-one_rates-gamma5
GY94_gamma-rates_gamma-omega GY94_CF3x4_omega-global-gamma5_rates-gamma5
GY94_gamma-rates_branchlocal-omega GY94_CF3x4_omega-branchlocal-one_rates-gamma5
KOSI07 KOSI07_F_omega-global-one_rates-one
KOSI07_gamma-rates KOSI07_F_omega-global-one_rates-gamma5
KOSI07_gamma-rates_gamma-omega KOSI07_F_omega-global-gamma5_rates-gamma5
KOSI07_gamma-rates_branchlocal-omega KOSI07_F_omega-branchlocal-one_rates-gamma5
experimental_FracTolerated experimental hyphyExpCMs_FracTolerated.ibf
experimental_HalpernBruno experimental hyphyExpCMs_HalpernBruno.ibf
experimental_FracTolerated_random1 experimental_randomize1 hyphyExpCMs_FracTolerated.ibf
experimental_FracTolerated_random2 experimental_randomize2 hyphyExpCMs_FracTolerated.ibf
experimental_HalpernBruno_random1 experimental_randomize1 hyphyExpCMs_HalpernBruno.ibf
experimental_HalpernBruno_random2 experimental_randomize2 hyphyExpCMs_HalpernBruno.ibf

Output files

Some summary output is printed to standard output. In addition, the following output is created:

  • For each of the specified models, the specified subdirectory is created with the input file phyloExpCM_optimizeHyphyTree_infile.txt, the log file phyloExpCM_optimizeHyphyTree_log.txt, and the errors file phyloExpCM_optimizeHyphyTree_errors.txt (this last file is empty if nothing is output to standard error). In addition, the subdirectory contains the output files created by running phyloExpCM_optimizeHyphyTree.py.

  • The file summaryfile lists the log likelihood, the number of branch lengths, and the number of independently optimized HYPHY parameters NOT including the branch lengths for each model.

    Note, however, that this parameter count does NOT include parameters not explicitly optimized by HYPHY but that are still estimated from data in an empirical way. For example, for the GY94 method with CF3x4, there are 9 such parameters (three nucleotide frequencies at each of three sites). For the KOSI07 method with F, there are 60 parameters (the frequencies of 60 of the 61 non-stop codons).

    Here is an example of the output in summaryfile:

    # model, log_likelihood, nbranchlengths, nparameters, nsharedparameters
    GY94, -4441.19, 45, 2, 2
    KOSI07_gamma-rates, -4389.25, 45, 3, 3
    KOSI07, -4410.96, 45, 2, 2
    GY94_gamma-rates, -4429.59, 45, 3, 3