Script for parsing mutational paths from .trees files created using BEAST with the Markov Jumps mutation-mapping feature.
The .trees files contain trees with mutations mapped to branches as created by BEAST. With specification of a starting and ending sequence (startseq and endseq), each tree represents a specific mutational path through sequence space from the starting to the ending sequence. For each tree (possibly excluding some initial trees as specified by burnin below), the mutational path is extracted and written to the text file mutpathsfile. In addition, all of the non burnin trees are also written without the branch mutation annotations to the new .trees file mergedtreesfile that can later be analyzed by the TreeAnnotator.
The input .trees files can either be those directly produced by BEAST, or can have been pre-processed by mutpath_compact_trees.py. However, if they have not been pre-processed by mutpath_compact_trees.py then running this script may consume a very large amount of memory.
To run this script, first create a text infile of the format described below in this documentation. If you name this input file infile.txt, then you would run the script with the single input file as an argument:
mutpath_get_paths.py infile.txt
If the script is not executable on your platform, you can use:
python mutpath_get_paths.py infile.txt
The input file should contain a series of keys (strings without spaces) that are followed by their values. Lines that begin with # are considered comments and are ignored, as are blank lines. The file should contain the following keys:
Here is an example input file:
# Input file to mutpath_get_paths.py
#
# List of six input files
intreefiles prots_BEASTrun_1/prots_compact.trees prots_BEASTrun_2/prots_compact.trees prots_BEASTrun_3/prots_compact.trees prots_BEASTrun_4/prots_compact.trees prots_BEASTrun_5/prots_compact.trees prots_BEASTrun_6/prots_compact.trees
#
# Remove the first 200 trees as burnin
burnin 200
#
# File to write merged trees after removing burnin, branch annotations
mergedtreesfile prot_trees_merged.trees
#
# FASTA file giving sequences for the strain names in intreefiles
fastafile prots.fasta
#
# Sequence type: DNA or protein
seqtype protein
#
# Name of starting and ending sequences for mutational path
startseq A/Aichi/2/1968_1968.50
endseq A/Brisbane/10/2007_2007.10
#
# File to write the mutational paths
mutpathsfile prot_mutpaths.txt
Running mutpath_get_paths.py creates two output files, mergedtreesfile and mutpathsfile.
mergedtreesfile is a new .trees file that contains all of the trees in the intreefiles, after removing the first burnin trees from each file. It also contains all of the preface and suffix information necessary to allow it to be read by TreeAnnotator. The mutation annotations on the branches have been removed, which shrinks the file to a manageable size (particularly if you have already run mutpath_compact_trees.py on the intreefiles to remove other information). You may want to use mergedtreefiles to generate a maximum-clade credibility tree with TreeAnnotator.
mutpathsfile is a new .txt file that contains the mutational paths. It lists each of the paths numbered starting at 1. Within each path, the mutations are indicated with numbering starting at 1 for the first position in the sequence. The times for the mutations, the starting and ending strains, and the most recent common ancestor of these two strains, are also indicated. These times are measured in units before the most recent tip node (so the root node would have the largest value of time). Here is an example of a file with two paths:
MUTPATH 1
startstrain_name A/Aichi/2/1968_1968.50
startstrain_seq ATGGCAATGGGCTAA
startstrain_time 42.5
endstrain_name A/Brisbane/10/2007_2007.10
endstrain_seq ATGACGATTGGATAA
endstrain_time 3.9
commonancestor_seq ATGGCGATGGGCTAA
commonancestor_time 43.12713
startstrain_to_commonancestor_path
A6G : 42.713
commonancestor_to_endstrain_path
G9T : 31.732
G4A : 25.1343
C12A : 10.134
MUTPATH 2
startstrain_name A/Aichi/2/1968_1968.50
startstrain_seq ATGGCAATGGGCTAA
startstrain_time 42.5
endstrain_name A/Brisbane/10/2007_2007.10
endstrain_seq ATGACGATTGGATAA
endstrain_time 3.9
commonancestor_seq ATGGCGATGGGCTAA
commonancestor_time 44.12713
startstrain_to_commonancestor_path
A6G : 42.113
G9T : 43.124
commonancestor_to_endstrain_path
G4A : 21.1343
C5A : 19.531
A5C : 19.402
C12A : 9.134