phyloExpCM_runcodonPhyML.py

This script allows you to estimate a tree topology using codonPhyML. It has been tested with codonPhyML version 1.00 201306.18.

To run the script, create an input file of the format described below. Then run the script followed by the input file, as in:

phyloExpCM_runcodonPhyML.py infile.txt

Format of the input file

The input file specifies key / value pairs on individual lines in the format:

key1 value1
key2 value2

Blank lines or lines that begin with # are ignored (i.e. as comment lines). The input file should contain the following keys:

  • seqfile should specify the name of a FASTA file with the sequences that we want to analyze. Each must have a unique header of less than 100 characters. These characters should not include any spaces or any of the following: :()," – an exception will be raised if these conditions are not met. In addition, the character * has special meaning if it is found at the end of a header (it specifies the sequence is an outgroup), so only include this if you know what you are doing. The sequences are assumed to be aligned coding sequences – an exception will be raised if they are not all of the same length.

  • codonPhyML_path should specify the path to the codonPhyML program. For example, this might just be codonphyml if the program is installed in the search path. Or it could be an absolute path name if the executable is not installed in the search path.

  • seed specifies the integer random number seed passed to codonPhyML. Setting the same seed for identical runs should guarantee identical output.

  • model specifies the codon substitution model to use with codonPhyML. Valid values are:

    • GY94_CF3x4 : The Goldman and Yang 1994 codon model with the equilibrium codon frequencies estimated empirically using the CF3x4 method, with kappa (the transition / transversion ratio) estimated by maximum likelihood, and with omega (the dN/dS ratio) estimated as a single parameter by maximum likelihood.
    • GY94_CF3x4_omega-gamma4 : The Goldman and Yang 1994 codon model with the equilibrium codon frequencies estimated empirically using the CF3x4 method, with kappa (the transition / transversion ratio) estimated by maximum likelihood, and with omega (the dN/dS ratio) drawn from four gamma-distributed categories with the distribution shape parameter and mean estimated by maximum likelihood.
    • KOSI07_F_omega-gamma4 : The Kosiol et al, 2007 empirical codon model with the equilibrium codon frequencies estimated empirically using the F method, with kappa(tv) (the relative decrease in transversions versus transitions) estimated by maximum likelihood, and with omega (the elevation in nonsynonymous over synonymous) drawn from four gamma-distributed categories with the distribution shape parameter and mean estimated by maximum likelihood. This is based on the ECM+F+omega+1kappa(tv) model described by Kosiol et al, 2007.
  • outprefix specifies the prefix for the names of files created by this script. Any existing files with these names are overwritten. The names and contents of theese files are specified in the output files section below.

Example input file

Here is an example input file:

# Example input file for phyloExpCM_runcodonPhyML.py
seqfile Human_NPs.fasta
codonPhyML_path codonphyml
seed 1
outprefix Human_NPs_codonphyml
model GY94_CF3x4_omega-gamma4

Output files

The output is as follows:

  • Basic progress of the script is written to standard output.

  • Files are created with the prefix outprefix and the following suffixes:

    • _config.drw is the darwin format input file used to run codonphyml.
    • _output.txt is the output of codonphyml.
    • _tree.newick is the inferred tree in Newick format. This tree contains branch supports calculated using the SH-aLRT method.
    • _stats.txt contains the statistics from the inference, including all parameter values.
    • _loglikelihood.txt contains a single number representing log likelihood.