mutpath_get_paths.py

Script for parsing mutational paths from .trees files created using BEAST with the Markov Jumps mutation-mapping feature.

The .trees files contain trees with mutations mapped to branches as created by BEAST. With specification of a starting and ending sequence (startseq and endseq), each tree represents a specific mutational path through sequence space from the starting to the ending sequence. For each tree (possibly excluding some initial trees as specified by burnin below), the mutational path is extracted and written to the text file mutpathsfile. In addition, all of the non burnin trees are also written without the branch mutation annotations to the new .trees file mergedtreesfile that can later be analyzed by the TreeAnnotator.

The input .trees files can either be those directly produced by BEAST, or can have been pre-processed by mutpath_compact_trees.py. However, if they have not been pre-processed by mutpath_compact_trees.py then running this script may consume a very large amount of memory.

To run this script, first create a text infile of the format described below in this documentation. If you name this input file infile.txt, then you would run the script with the single input file as an argument:

mutpath_get_paths.py infile.txt

If the script is not executable on your platform, you can use:

python mutpath_get_paths.py infile.txt

Input file format

The input file should contain a series of keys (strings without spaces) that are followed by their values. Lines that begin with # are considered comments and are ignored, as are blank lines. The file should contain the following keys:

  • intreefiles should specify a list of strings giving the names of one or more .trees files. These file names cannot contain spaces. These should be the .trees files created by running BEAST with the Markov Jumps mutation mapping. The .trees files can optionally have been compacted using mutpath_compact_trees.py. Note that all of these .trees files must be for the same input files, and so should be identical up to the beginning of the tree entries.
  • burnin should specify how many trees from each file in intreefiles are removed as burn-in (i.e. equilibration). Note that this is the number of trees that are removed, not the number of MCMC steps. So for example, if you ran an MCMC of 20,000,000 steps sampling at every 10,000 steps, there are 2,000 trees in each file. So t remove the first 10%, you would set burnin to 200.
  • mergedtreesfile should specify the name of the new .trees file we create that contains all of the trees specified in intreefiles after removing the burnin trees from each file, and after removing the branch annotations. This file may be rather large if you have not first run mutpath_compact_trees.py on the intreefiles. If mergedtreefiles already exists, it is overwritten.
  • fastafile should specify a FASTA file with headers equal to the sequence names used to create the BEAST input files used to generate the intreefiles, and then giving the full sequences. This file is used to annotate the sequences in the intreesfiles. In other words, when looking at the intreefiles taxa labels, each taxa label must correspond to a header in fastafile, and then have the sequence provided.
  • startseq should specify the name of the starting sequence for the mutational path. This should be the same sequence name used in BEAST input file, and so should be present in the taxa labels in the .trees file.
  • endseq should specify the name of the ending sequence for the mutational path. Like startseq, it should be the same sequence name used in the taxa labels in the .trees file.
  • seqtype should specify the sequence type: it can be “DNA” or “protein”.
  • mutpathsfile should specify the name of the new .txt file that is created and contains all of the mutational paths. The format is described below. If mutpathsfile already exists, it is overwritten.

Example input file

Here is an example input file:

# Input file to mutpath_get_paths.py
#
# List of six input files
intreefiles prots_BEASTrun_1/prots_compact.trees prots_BEASTrun_2/prots_compact.trees prots_BEASTrun_3/prots_compact.trees prots_BEASTrun_4/prots_compact.trees prots_BEASTrun_5/prots_compact.trees prots_BEASTrun_6/prots_compact.trees
#
# Remove the first 200 trees as burnin
burnin 200
#
# File to write merged trees after removing burnin, branch annotations
mergedtreesfile prot_trees_merged.trees
#
# FASTA file giving sequences for the strain names in intreefiles
fastafile prots.fasta
#
# Sequence type: DNA or protein
seqtype protein
#
# Name of starting and ending sequences for mutational path
startseq A/Aichi/2/1968_1968.50
endseq A/Brisbane/10/2007_2007.10
#
# File to write the mutational paths
mutpathsfile prot_mutpaths.txt

Output files

Running mutpath_get_paths.py creates two output files, mergedtreesfile and mutpathsfile.

  • mergedtreesfile is a new .trees file that contains all of the trees in the intreefiles, after removing the first burnin trees from each file. It also contains all of the preface and suffix information necessary to allow it to be read by TreeAnnotator. The mutation annotations on the branches have been removed, which shrinks the file to a manageable size (particularly if you have already run mutpath_compact_trees.py on the intreefiles to remove other information). You may want to use mergedtreefiles to generate a maximum-clade credibility tree with TreeAnnotator.

  • mutpathsfile is a new .txt file that contains the mutational paths. It lists each of the paths numbered starting at 1. Within each path, the mutations are indicated with numbering starting at 1 for the first position in the sequence. The times for the mutations, the starting and ending strains, and the most recent common ancestor of these two strains, are also indicated. These times are measured in units before the most recent tip node (so the root node would have the largest value of time). Here is an example of a file with two paths:

    MUTPATH 1
    startstrain_name A/Aichi/2/1968_1968.50
    startstrain_seq ATGGCAATGGGCTAA
    startstrain_time 42.5
    endstrain_name A/Brisbane/10/2007_2007.10
    endstrain_seq ATGACGATTGGATAA
    endstrain_time 3.9
    commonancestor_seq ATGGCGATGGGCTAA
    commonancestor_time 43.12713
    startstrain_to_commonancestor_path
    A6G : 42.713
    commonancestor_to_endstrain_path
    G9T : 31.732
    G4A : 25.1343
    C12A : 10.134
    
    MUTPATH 2
    startstrain_name A/Aichi/2/1968_1968.50
    startstrain_seq ATGGCAATGGGCTAA
    startstrain_time 42.5
    endstrain_name A/Brisbane/10/2007_2007.10
    endstrain_seq ATGACGATTGGATAA
    endstrain_time 3.9
    commonancestor_seq ATGGCGATGGGCTAA
    commonancestor_time 44.12713
    startstrain_to_commonancestor_path
    A6G : 42.113
    G9T : 43.124
    commonancestor_to_endstrain_path
    G4A : 21.1343
    C5A : 19.531
    A5C : 19.402
    C12A : 9.134
    

Table Of Contents

Previous topic

mutpath_compact_trees.py

Next topic

mutpath_annotate_tree.py

This Page