mutpath Python API

mutpath is designed so that you can use it simply by running the main scripts, and do not need to interface directly with the Python code. However, if you want to access the Python modules that constitute the package, here is the API.

mutpath Package

This package implements mutpath.

This package is designed for building mutational paths through sequence space using mutation-mapped trees written by BEAST (using the MarkovJumps feature).

Written by Jesse Bloom.

align Module

This module contains functions to run sequence alignment programs.

It can be used to run PROBCONS or MUSCLE.

Written by Jesse Bloom.

align.AddDots(aligned_headers_seqs)

Adds dots at identities in multiple sequence alignment.

Takes as an argument a list of two or more aligned sequences, as would be returned by Align.

Returns a copy of this list. The first sequence in the list is unchanged. In all remaining sequences, dot characters (”.”) have been used to replace any amino acids that are identical to the amino acid at the same position in the first sequence, except for gap characters. Characters are replaced even if they are not of the same case

>>> AddDots([('seq1', '-TGC'), ('seq2', 'AGGC'), ('seq3', '-tac')])
[('seq1', '-TGC'), ('seq2', 'AG..'), ('seq3', '-.a.')]
align.Align(headers_seqs, progpath, program='MUSCLE')

Performs a multiple sequence alignment of two or more sequences.

By default, the protein sequences are aligned using MUSCLE. This program can be used to align either nucleotide or protein sequences. You can also use PROBCONS to align protein sequences.

headers_seqs is a list specifying the names of the sequences that we want to align. Each entry is a 2-tuple (head, seq) where head is a header giving the sequence name and other information (might be empty) and seq is a string giving the protein sequence. The list must have at least 2 entries.

progpath should specify a directory containing the alignment program executable, either PROBCONS or MUSCLE. The PROBCONS executable is assumed to have the name “probcons” in this directory. The MUSCLE executable is assumed to have the name “muscle” in this directory.

program specifies what program to use for the alignment. By default, it is “MUSCLE”. If you wish to use PROBCONS instead, set it to “PROBCONS”.

This executable is used to perform a multiple sequence alignment of the proteins with the default settings of either PROBCONS or MUSCLE. The returned variable is a new list aligned_headers_seqs. Each entry is a 2-tuple (head, aligned_seq). head has the same meaning as on input (the sequence header) and aligned_seq is the aligned sequence, with gaps inserted as ‘-‘ as appropriate. Therefore, all of the aligned_seq entries in aligned_headers_seqs are of the same length. The entries in aligned_headers_seq are in the same order as in the input list headers_seqs.

align.PairwiseStatistics(aligned_headers_seqs)

Computes the number of gaps and identities in a pairwise alignment.

This method is designed to compute statistics about an alignment of two
sequences.

aligned_headers_seqs is a pair of aligned sequences, as would be returned by calling Align with two sequences (which should be of the same length since they have been aligned). That is, it is a list of 2-tuples:

[(head1, alignedseq1), (head2, alignedseq2)]

The method returns the 2-tuple (identities, gaps). identities is a number between zero and one. It is the fraction of residues in one sequence that are aligned with identical residues in the other sequence, gaps not being included in the tally. This is computed by dividing the number of identities by the total length of the aligned sequence excluding gaps. gaps is the fraction of gaps in the alignment. It is the fraction of the positions in the alignment where either sequence has a gap. So it is computed by dividing the total number of gaps by the length of the aligned sequences. Upper and lower case nucleotides are treated equivalently.

>>> print round(PairwiseStatistics([('seq1', 'TGCAT'), ('seq2', 'AG-AT')])[0], 2)
0.75
>>> print round(PairwiseStatistics([('seq1', 'tgcat'), ('seq2', 'AG-AT')])[1], 2)
0.2
align.RemoveDots(aligned_headers_seqs)

Removes dots at identities in a multiple sequence alignment.

This function effectively undoes what can be done by ‘AddDots’.

Takes as an argument a list of two or more aligned sequences in the form of (header, sequence) tuples, as would be returned by Align.

Returns a copy of this list. Any positions where one of the sequences after the first sequence has a ‘.’ character, the amino acid is changed to that found at the same position in the first sequence.

>>> RemoveDots([('seq1', '-TGC'), ('seq2', 'AG..'), ('seq3', '-.a.')])
[('seq1', '-TGC'), ('seq2', 'AGGC'), ('seq3', '-TaC')]
align.StripGapsToFirstSequence(aligned_headers_seqs)

Strips gaps from a reference sequence, and all corresponding alignments.

On input, aligned_headers_seqs should be a set of two or more aligned sequences, as would be returned by Align.

The first sequence in this alignment is taken to correspond to the reference sequence. The returned variable is a list similar to aligned_headers_seqs, but with all positions corresponding to gaps in this reference sequence stripped away. All gaps (‘-‘) characters are removed from this reference sequence. In addition, in all other aligned sequences in aligned_headers_seqs, every character at the same position as a gap in the reference sequence is removed. Therefore, at the end of this procedure, all of the alignments have the same length as the reference sequence with its gaps stripped away. The headers are unchanged. The order of sequences in this stripped alignment is also unchanged.

>>> StripGapsToFirstSequence([('s1', '-AT-A-GC'), ('s2', 'AAT-TAGC'), ('s3', '--T-A-GC')])
[('s1', 'ATAGC'), ('s2', 'ATTGC'), ('s3', '-TAGC')]
align.StripLeadingTrailingGapsToFirstSequence(aligned_headers_seqs)

Strips leading and trailing gaps from first sequence, and trims corresponding alignments.

On input, aligned_headers_seqs should be a set of two or more aligned sequences, as would be returned by Align.

The first sequence in the alignment is taken to correspond to the reference sequence. The returned variable is a list similar to aligned_headers_seqs, but all leading and trailing gaps have been stripped from the reference sequence. A leading gap (‘-‘) is one that precedes the first non-gap character, and a trailing gap is one that follows the last non-gap character. All other sequences have all positions corresponding the leading/trailing gaps of the reference sequence trimmed as well. The headers and order of sequences are preserved.

>>> StripLeadingTrailingGapsToFirstSequence([('s1', '--ATA-GC-'), ('s2', 'TGATTA-CA')])
[('s1', 'ATA-GC'), ('s2', 'ATTA-C')]

beast Module

Module for functions related to reading / writing BEAST files.

This module is designed to interface with the BEAST phylogenetics package.

Written by Jesse Bloom.

beast.WriteXML(xmlfile, seqs, datatype, chainlength, screenloginterval, fileloginterval, usemarkovjumps, sitemodel)

Writes an XML file that can serve as input to BEAST.

Dates are assigned to sequences based on the values specified in the names in seqs.

The file logs are given the same base prefix as that for xmlfile itself, but the extensions are .log and .trees.

xmlfile a string giving the name of the XML file that we create. It is overwritten if it already exists.

seqs a list of sequences in the form (header, seq). These sequences are assumed to have dates assigned, and are specified following a terminal underscore, such as:

A/Nanchang/933/1995_1995.00

datatype the data type for the sequences, might be ‘nucleotide’ or ‘amino acid’.

chainlength integer giving the chain length for the MCMC.

screenloginterval integer specifying the frequency with which we log MCMC information to the screen.

fileloginterval integer specifying the frequency with which we log MCMC information to the file logs.

usemarkovjumps Boolean switch. If True, we use the Markov Jumps feature to write mutations / sequences for the .trees file. This allows reconstruction of mutational paths, but also increases the .trees file size dramatically.

sitemodel the site substitution model. Currently allows ‘JTT’ for proteins and ‘HKY’ for nucleic acids.

Various other options are currently hardcoded in the method. Options could be set to allow them to be changed. Currently:

  • uses a coalescent-based prior on the trees with constant population size with Jeffrey’s prior.
  • uses a strict clock with uniform rates across branches, gamma prior on rate.

io Module

Module for performing input/output operations.

Functions are:

ParseInfile : reads key / value pairs from an input file.

ParseBoolValue : reads Boolean values from input dictionary.

ParseIntValue : reads integer values from input dictionary.

ParseFloatValue : reads float values from input dictionary.

ParseStringValue : reads string values from input dictionary.

ParseSeqValue : reads sequence from file specified by input dictionary.

ParseFileList : reads file list specified by input dictionary.

Written by Jesse Bloom.

src.io.ParseBoolValue(d, key)

Gets Boolean argument from input dictionary.

d is a dictionary of key/value string pairs as returned by ParseInfile.

key is a string in this dictionary that specifies a value that should be True or False.

Returns the Boolean truth value specified by key. Raises a ValueError if key does not exist in d, and a ValueError if it specifies something other than True or False.

>>> d = {'gzipped':'True', 'applyfilter':'False', 'a1':'a1.txt'}
>>> ParseBoolValue(d, 'gzipped')
True
>>> ParseBoolValue(d, 'applyfilter')
False
>>> ParseBoolValue(d, 'a1')
Traceback (most recent call last):
   ...
ValueError: value a1.txt for a1 is not True/False
>>> ParseBoolValue(d, 'otherkey')
Traceback (most recent call last):
   ...
ValueError: did not find a key for otherkey
src.io.ParseFileList(d, key)

Gets list of files from input dictionary.

d is a dictionary of key/value string pairs as returned by ParseInfile.

key is a string in this dictionary that specifies a value that should give a list of one or more file names.

Returns a list of the file names specified by key. If one or more of these files does not exist, raises and IOError. If key does not exist in d, raises and KeyError.

src.io.ParseFloatValue(d, key)

Gets float argument from input dictionary.

d a dictionary of key/value string pairs as returned by ParseInfile.

key is a string in this dictionary that specifies a value that should be convertible to a float.

Returns the float specified by key. Raises a ValueError if key does not exist in d, and a ValueError if it specifies something other than an integer.

>>> d = {'maxn':'2', 'minq':'20.5', 'string':'hello'}
>>> print "%.1f" % ParseFloatValue(d, 'maxn')
2.0
>>> print "%.1f" % ParseFloatValue(d, 'minq')
20.5
>>> ParseFloatValue(d, 'string')
Traceback (most recent call last):
   ...
ValueError: value hello for string is not float
>>> ParseFloatValue(d, 'otherkey')
Traceback (most recent call last):
   ...
ValueError: did not find a key for otherkey
src.io.ParseInfile(f)

Reads key / value pairs from an input file.

f should be a readable file-like object.

Starting at the current position in f, reads all remaining lines until the end of the file-like object. Any blank line or line with a first character of # is disregarded (# indicates a comment line). All other lines should contain exactly two entries, the first being the key and the second being the value. The key is construed to be the first word, and cannot have any spaces. The value is all of the text that follows the key up to the newline. The key / value pairs are returned in a dictionary, as illustrated in the example below. If there is a duplicate key, and exception is raised.

Example of successfully reading two key/value pairs

>>> f = cStringIO.StringIO()
>>> f.write('# comment line followed by blank line\n\n')
>>> f.write('key1 first_value\n')
>>> f.write('# now another key with two values\nkey2 2 value2')
>>> f.seek(0)
>>> ParseInfile(f) == {'key1':'first_value', 'key2':'2 value2'}
True

Example of duplicate key name

>>> f = cStringIO.StringIO()
>>> f.write('# comment line followed by blank line\n\n')
>>> f.write('key1 first_value\n')
>>> f.write('# now another key\nkey2 2')
>>> f.write('\nkey1 1')
>>> f.seek(0)
>>> ParseInfile(f) == {'key1':'first_value', 'key2':'2'}
Traceback (most recent call last):
    ...
ValueError: duplicate key: key1
src.io.ParseIntValue(d, key)

Gets integer argument from input dictionary.

d is a dictionary of key/value string pairs as returned by ParseInfile.

key is a string in this dictionary that specifies a value that should be an integer.

Returns the integer specified by key. Raises a ValueError if key does not exist in d, and a ValueError if it specifies something other than an integer.

>>> d = {'maxn':'2', 'minq':'20.5'}
>>> ParseIntValue(d, 'maxn')
2
>>> ParseIntValue(d, 'minq')
Traceback (most recent call last):
   ...
ValueError: value 20.5 for minq is not integer
>>> ParseIntValue(d, 'otherkey')
Traceback (most recent call last):
   ...
ValueError: did not find a key for otherkey
src.io.ParseSeqValue(d, key)

Reads sequence from FASTA file specified by input dictionary.

d is a dictionary of key/value string pairs as returned by ParseInfile.

key is a string in this dictionary that specifies a value that is a filename of a readable file that contains exactly one sequence in FASTA format.

Returns a string corresponding to the sequence specified in the FASTA file. Raises a ValueError if the filename does not exist, or does not specify exactly one sequence.

src.io.ParseStringValue(d, key)

Reads string argument specified by input dictionary.

d is a dictionary of the key/value string pairs as returned by ParseInfile.

key is a string in this dictionary that specifies a value that is a string.

Returns string corresponding to key. Raises a ValueError if the key does not exist in d.

>>> d = {'outfileprefix':'test_example', 'samplename':'test example'}
>>> ParseStringValue(d, 'outfileprefix')
'test_example'
>>> ParseStringValue(d, 'samplename')
'test example'
>>> ParseStringValue(d, 'otherkey')
Traceback (most recent call last):
   ...
ValueError: did not find a key for otherkey

plot Module

Module for performing plotting for mutpath package.

This module uses pylab and matplotlib to make plots. These plots will fail if pylab and matplotlib are not available for importation. Before running any function in this module, you can run the PylabAvailable function to determine if pylab and matplotlib are available. Otherwise, calling any other function will raise an Exception if thise modules are not available. The pdf backend is used for matplotlib / pylab. This means that plots must be created as PDF files.

Functions are:

PylabAvailable

CumulativeFractionPlot

‘DatesPlot`

Base10Formatter

SplitLabel

Written by Jesse Bloom.

plot.Base10Formatter(number, exp_cutoff, exp_decimal_digits, decimal_digits)

Converts a number into Latex formatting with scientific notation.

Takes a number and converts it to a string that can be shown in LaTex using math mode. It is converted to scientific notation if the criteria specified by exp_cutoff.

number the number to be formatted, should be a float or integer. Currently only works for numbers >= 0

exp_cutoff convert to scientific notation if abs(math.log10(number)) >= this.

exp_decimal_digits show this many digits after the decimal if number is converted to scientific notation.

decimal_digits show this many digits after the decimal if number is NOT converted to scientific notation.

The returned value is the LaTex’ string. If the number is zero, the returned string is simply ‘0’.

>>> Base10Formatter(103, 3, 1, 1)
'103.0'
>>> Base10Formatter(103.0, 2, 1, 1)
'1.0 \\times 10^{2}'
>>> Base10Formatter(103.0, 2, 2, 1)
'1.03 \\times 10^{2}'
>>> Base10Formatter(2892.3, 3, 1, 1) 
'2.9 \\times 10^{3}'
>>> Base10Formatter(0.0, 3, 1, 1) 
'0'
>>> Base10Formatter(0.012, 2, 1, 1)
'1.2 \\times 10^{-2}'
>>> Base10Formatter(-0.1, 3, 1, 1)
Traceback (most recent call last):
    ...
ValueError: number must be >= 0
plot.CumulativeFractionPlot(datalist, plotfile, title, xlabel)

Creates a cumulative fraction plot.

Takes a list of numeric data. Plots a cumulative fraction plot giving the fraction of the data points that are <= the indicated value.

datalist is a list of numbers giving the data for which we are computing the cumulative fraction plot. Raises an exception if this is an empty list.

plotfile is the name of the output plot file created by this method (such as ‘plot.pdf’). The extension must be ‘.pdf’.

title is a string placed above the plot as a title. Uses LaTex formatting.

xlabel is the label given to the X-axis. Uses LaTex formatting.

This function uses pylab / matplotlib. It will raise an Exception if these modules cannot be imported (if PylabAvailable() is False).

plot.DatesPlot(mutdates, plotfile, interval)

Plots dates of mutations.

Uses pylab / matplotlib to plot the dates and credible intervals for mutations. Will raise an error PylabAvailable() == False. The plot is a PDF.

  • mutdates is a list of the mutations, in the form of the tuples (median, mininterval, maxinterval, mut, fractoca, weight). Mutations are plotted in the order they are listed. In these tuples:
    • median : posterior median date
    • minterval : minimum of credible interval
    • maxinterval : maximum of credible interval
    • mut : string giving name of mutation
    • fractoca : probability mutation is on path from common ancestor to starting sequence
    • weight : fraction of paths containing mutation.
  • plotfile is a string giving the name of the PDF file we create.
  • interval is the range of the credible interval. For example, 0.9 means a 90% credible interval.
plot.PylabAvailable()

Returns True if pylab/matplotlib available, False otherwise.

You should call this function to test for the availability of the pylab/matplotlib plotting modules before using other functions in this module.

plot.SplitLabel(label, splitlen, splitchar)

Splits a string with a return if it exceeds a certain length.

label a string giving the label we might split.

splitlen the maximum length of a label before we attempt to split it.

splitchar the character added when splitting a label.

If len(label) > splitlen, we attempt to split the label in the middle by adding splitchar. The label is split as close to the middle as possible while splitting at a space.

No splitting as label length less than splitlen

>>> SplitLabel('WT virus 1', 10, '\n')
'WT virus 1'

Splitting of this label

>>> SplitLabel('WT plasmid 1', 10, '\n')
'WT\nplasmid 1'

Splitting of this label

>>> SplitLabel('mutated WT plasmid 1', 10, '\n')
'mutated WT\nplasmid 1'

sequtils Module

This file contains utilities for manipulating sequences.

Performs functions such as reading / writing FASTA files, translating sequences, etc.

Written by Jesse Bloom.

sequtils.AAThreeToOne(aathree)

Converts a three letter amino acid code into a one letter code.

The single input argument is the three letter amino acid code. It can be of any case.

This function returns, in upper case, the one letter amino acid code. It raises an exception if aathree is not a valid one letter code. ‘Xaa’ is converted to ‘X’.

>>> AAThreeToOne('Ala')
'A'
>>> AAThreeToOne('cys')
'C'
>>> AAThreeToOne('Xaa')
'X'
>>> AAThreeToOne('hi')
Traceback (most recent call last):
   ...
ValueError: Invalid amino acid code of hi.
sequtils.AmbiguousNTCodes(nt)

Returns all possible nucleotides corresponding to an ambiguous code.

This method takes as input a single nucleotide character nt, which is assumed to represent a nucleotide as one of the accepted codes for an ambiguous character. Returns a list giving all possible codes for which a nucleotide might stand. Raises an exception if nt is not a valid nucleotide code.

>>> AmbiguousNTCodes('N')
['A', 'T', 'G', 'C']
>>> AmbiguousNTCodes('R')
['A', 'G']
>>> AmbiguousNTCodes('A')
['A']
>>> AmbiguousNTCodes('-')
['-']
>>> AmbiguousNTCodes('F')
Traceback (most recent call last):
   ...
ValueError: Invalid nt code of "F"
sequtils.CondenseSeqs(seqs, maxdiffs, exclude_positions)

Removes nearly identical protein sequences.

seqs is a list of sequences as (head, seq) 2-tuples. The sequences are assumed to be aligned.

maxdiffs specifies the maximum number of differences that a sequence can have from another sequence in order to be removed.

exclude_positions is a list of integers specifying positions at which sequence can NOT differ and still be removed. These integers are for numbering the sequences as 1, 2, ...

The method proceeds as follows:

  1. For the first sequence iterates through the rest of the sequences, and removes any that differ at <= ndiff sites, and do not differ at the sites specified by ‘exclude_position’.
  2. Repeats this process for each of the remaining sequences.

The returned variable is seqs with the removed sequences gone.

>>> seqs = [('s1', 'ATGC'), ('s2', 'ATGA'), ('s3', 'ATC-'), ('s4', 'ATGC')]
>>> CondenseSeqs(seqs, 0, [])
[('s1', 'ATGC'), ('s2', 'ATGA'), ('s3', 'ATC-')]
>>> CondenseSeqs(seqs, 1, [])
[('s1', 'ATGC'), ('s3', 'ATC-')]
>>> CondenseSeqs(seqs, 1, [4])
[('s1', 'ATGC'), ('s2', 'ATGA'), ('s3', 'ATC-')]
sequtils.DateToOrdinal(datestring, refyear)

Converts a date string to an ordinal date.

datestring is a date given by a string such as ‘2007/2/13’ (for Feb-13-2007), or ‘2007/2//’ if no day is specified, or ‘2007//’ if no day or month is specified. The ‘/’ characters can also be ‘-‘.

refdate is an integer year from the approximate timeframe we are examining which is used to anchor the datestring date on the assumption that each year has 365.25 days.

The returned value is a number (decimal) giving the date. If no day is specified, the 15th (halfway through the month) is chosen. If no month or day is specified, July 1 (halfway through the year) is chosen.

>>> print "%.2f" % DateToOrdinal('2007/4/27', 1968)
2007.32
>>> print "%.2f" % DateToOrdinal('2007/4/', 1968)
2007.29
>>> print "%.2f" % DateToOrdinal('2007//', 1968)
2007.50
>>> print "%.2f" % DateToOrdinal('2007-4-27', 1968)
2007.32
sequtils.FindMotifs(seq, motif)

Finds occurrences of a specific motif in a nucleotide sequence.

seq is a string giving a nucleotide sequence.

motif is a string giving the motif that we are looking for. It should be a string of valid nucleotide characters: * A Adenine * G Guanine * C Cytosine * T Thymine * U Uracil * R Purine (A or G) * Y Pyrimidine (C or T) * N Any nucleotide * W Weak (A or T) * S Strong (G or C) * M Amino (A or C) * K Keto (G or T) * B Not A (G or C or T) * H Not G (A or C or T) * D Not C (A or G or T) * V Not T (A or G or C)

The returned variable is a list motif_indices of the indices that each occurrence of motif in seq begins with. For example, if there is a motif beginning at seq[7], then 7 will be present in motif_indices. So the number of occurrences of the motif will be equal to len(motif_indices).

This function is not case sensitive, so nucleotides can be either upper or lower case. In addition, T (thymine) and U (uracil) nucleotides are treated identically, so the function can handle either DNA or RNA sequences.

>>> FindMotifs('ATCGAA', 'WCGW')
[1]
sequtils.GetEntries(namelist, fastafile, allow_substring=False)

Gets selected entries from a (potentially very large) FASTA file.

This method is designed to extract sequences from a FASTA file. It will work even if the FASTA file is very large, since it avoids reading the entire file into memory at once.

namelist specifies the “names” of the sequences that we want to extract from the FASTA file. The “name” of a sequence is the string immediately following the “>” in the FASTA file header for a sequence, terminated by a space character (space, tab, or return). So for example, the header:

>E_coli_thioredoxin: the thioredoxin protein from E. coli

would correspond to a name of “E_coli_thioredoxin”. ‘namelist’ specifies a list of these names.

fastafile is the name of a FASTA file that contains the sequences we are searching for. For this method to be guaranteed to work properly, each sequence in the FASTA file must contain a unique name, where a “name” is as defined above. Note that this uniqueness of names is not rigorously checked for, so if there are not unique names, the function may raise an exception, or it may continue along and give no hint of the problem.

allow_substring is an optional Boolean switch that specifies that the name given in namelist need only be a substring of the first entry in the fastafile header.

The function expects to find exactly one entry in fastafile for each name listed in namelist. If it does not, it will raise an exception. The returned variable is a list composed of 2-tuples. Element i of this list corresponds to the name given by namelist[i]. Each 2-tuple has the form ‘(header, sequence)’ where ‘header’ is the full FASTA header, but with the leading “>” character and any trailing linebreaks/spaces removed. ‘sequence’ is a string giving the sequence, again with the trailing linebreak removed.

sequtils.GetSequence(header, headers_sequences)

Gets a particular sequence based on its header name.

header specifies the name of a sequence’s FASTA header.

headers_sequences is a list of tuples ‘(head, seq)’ as would be returned by Read.

This function searches through headers_sequences and returns the sequence corresponding to the first header found that matches the calling argument header. If no such header is found, raises an exception.

sequtils.PurgeAmbiguousDNA(headers_sequences)

Removes all sequences with ambiguous positions from nucleotide sequences.

This function takes a single calling argument headers_sequences, which is a list of tuples (header, seq) as would be returned by Read. These sequences should specify nucleotide sequences. It returns a new list which is a copy of headers_sequences, except that all sequences that contain ambiguous nucleotide entries (i.e. characters that are ‘A’, ‘T’, ‘C’, ‘G’, ‘a’, ‘t’, ‘c’, or ‘g’) have been removed.

sequtils.PurgeDuplicates(headers_sequences)

Removes all duplicate sequences and those that are substrings.

This function takes a single calling argument headers_sequences, which is a list of tuples (header, seq) as would be returned by Read. It returns a new list in which each sequence appears exactly once. If a sequence appears more than once in the original calling list, the sequence that is kept is the first one encountered in the list. Sequences that are substrings of others are also removed.

The ordering of sequences is preserved except for the removal of duplicates. Sequences are returned as all upper case.

>>> PurgeDuplicates([('seq1', 'atgc'), ('seq2', 'GGCA'), ('seq3', 'ATGC')])
[('seq1', 'ATGC'), ('seq2', 'GGCA')]
>>> PurgeDuplicates([('seq1', 'atgcca'), ('seq2', 'TGCC')])
[('seq1', 'ATGCCA')]
sequtils.Read(fastafile)

Reads sequences from a FASTA file.

fastafile should specify the name of a FASTA file.

This function reads all sequences from the FASTA file. It returns the list headers_seqs. This list is composed of a 2-tuple ‘(header, seq)’ for every sequence entry in FASTA file. ‘header’ is the header for a sequence, with the leading “>” and any trailing spaces removed. ‘seq’ is the corresponding sequence.

sequtils.ReverseComplement(heads_seqs)

Converts nucleotide sequences to their reverse complements.

The single input argument heads_seqs is a list of sequences in the format of tuples (head, seq) where head is the header and seq is the sequence. The sequences should all be nucleotide sequences composed exclusively of A, T, C, or G (or their lowercase equivalents). Ambiguous nucleotide codes are currently not accepted.

The returned variable is a copy of heads_seqs in which the headers are unchanged but the sequences are converted to reverse complements.

>>> ReverseComplement([('seq1', 'ATGCAA'), ('seq2', 'atgGCA')])
[('seq1', 'TTGCAT'), ('seq2', 'TGCcat')]
>>> ReverseComplement([('seq1', 'ATGNAA')])
Traceback (most recent call last):
    ...
ValueError: Invalid nucleotide code.
sequtils.Translate(headers_sequences, readthrough_n=False, readthrough_stop=False, truncate_incomplete=False, translate_gaps=False)

Translates a set of nucleotide sequences to amino acid sequences.

This function takes as input a single calling argument header_sequences, which is a list of tuples (header, seq) as would be returned by Read. The sequences should all specify valid coding nucleotide sequences. The returned variable is a new list in which all of the nucleotide sequences have been translated to their corresponding protein sequences, given by one letter codes. Stop codons are translated to ‘*’. If any of the nucleotide sequences do not translate to valid protein sequences, an exception is raised.

The optional argument readthrough_n specifies that if any nucleotides in the sequence are equal to to an ambiguous nt code and cannot therefore be unambiguously translated into an amino acid, we simply translate through these nucleotides by making the corresponding amino acid equal to “X”. By default, this option is False. Note that even when this option is False, certain ambiguous nucleotides may still be translatable if they all lead to the same amino acid.

The optional argument readthrough_stop specifies that if we encounter any stop codons, we simply translation them to ‘*’. By default, this option is False, meaning that we instead raise an error of an incomplete stop codon.

The optional argument truncate_incomplete specifies that if the sequence length is not a multiple of three, we simply truncate off the one or two final nucleotides to make the length a multiple of three prior to translation. By default, this option is False, meaning that no such truncation is done.

The optional argument translate_gaps specifies that a codon containing ‘-‘ is translated to ‘-‘.

>>> Translate([('seq1', 'ATGTAA'), ('seq2', 'gggtgc')])
[('seq1', 'M*'), ('seq2', 'GC')]
>>> Translate([('seq2', 'GGNTGC')])
[('seq2', 'GC')]
>>> Translate([('seq2', 'NGGTGC')])
Traceback (most recent call last):
   ...
ValueError: Cannot translate codon NGG
>>> Translate([('seq2', 'NGGTGC')], readthrough_n=True)
[('seq2', 'XC')]
>>> Translate([('seq2', 'TAATGC')])
Traceback (most recent call last):
   ...
ValueError: Premature stop codon
>>> Translate([('seq2', 'TAATGC')], readthrough_stop=True)
[('seq2', '*C')]
>>> Translate([('seq2', 'TGCA')])
Traceback (most recent call last):
   ...
ValueError: Sequence length is not a multiple of three
>>> Translate([('seq2', 'TGCA')], truncate_incomplete=True)
[('seq2', 'C')]
>>> Translate([('seq2', 'TGC---')])
Traceback (most recent call last):
   ...
ValueError: Cannot translate gap.
>>> Translate([('seq2', 'TGC---')], translate_gaps=True)
[('seq2', 'C-')]
sequtils.UnknownsToGaps(headers_sequences)

Converts all unknown ambiguous amino acids to gap characters.

This function converts all of the common codes for unknown amino acids to gaps. That is, “B” (Asp or Asn), “Z” (Glu or Gln), and “X” (any amino acid) are all converted to the character “-”, which is usually taken to denote a gap. You would use this method if you are subsequently processing the sequences with a program that recognizes the gap character but not these three other unknown characters. Note, of course, that you are changing your sequence, so don’t do this unless you need to.

This method takes as a single calling variable the list header_sequences as would be returned by Read. It returns a copy of this list, but with all unknown amino acids replaced by “-”.

sequtils.Write(headers_seqs, filename, writable_file=False)

Writes sequences to a FASTA file.

headers_seqs is a list of 2-tuples specifying sequences and their corresponding headers. Each entry is the 2-tuple (header, seq) where header is a string giving the header (without the leading “>”), and seq is the corresponding sequence.

filename is a string that specifies the name of the file to which the headers and sequences should be written. If this file already exists, it is overwritten.

writable_file is a Boolean switch specifying that rather than filename giving a string specifying the name of a file to which the sequences should be written, it instead specifies a writable file object to which the sequences should be written.

The sequences are written to the file in the same order that they are specified in headers_seqs.

sequtils.WriteNEXUS(seqs, outfile, seqtype)

Write sequences to a NEXUS file.

seqs is a list of 2-tuples specifying sequences as (name, sequence). The sequences should all be aligned so that they are of the same length.

outfile is the name of the output file we are writing, such as ‘file.nex’

seqtype is the sequence type, such as ‘PROTEIN’ or ‘DNA’

tree Module

Module establishing classes and functions for representing phylogenetic trees.

Written by Jesse Bloom.

Functions defined in this module

ApplyToNodes : Applies a function recursively to all descendents of a node.

GetBalancedIndex : Finds the index of balancing parentheses character.

GetPath : Returns the path along a tree between two nodes.

GetInfo : Gets information from a BEAST format information bracket.

TallySiteChangesAlongTree : Counts times a site has changed along a tree.

SplitNewick : Splits a bifurcating tree into its next two branches.

RecursivelyAssignDescendents : Assigns descendents to a node of a tree.

RecursivelySetNumbers : Sets the numbers for nodes of a tree.

RecursivelySetTipAndInternalNumbers : Sets numbers for tip and internal nodes.

GetNode : Gets a specific node from a tree.

ListsOfNodes : Returns lists of all internal and tip nodes in a tree.

QueryTree : Gets information about a particular node of a tree.

AssignTimes : Assigns times before the most recent tip to all nodes of a tree.

RecursivelyAssignTimes : Assigns times to all descendents of a node.

PrintAllTreeInformation : Prints information about nodes, names, sequences.

Classes defined in this module

Tree : A class for representing a phylogenetic tree (composed of Node objects).

Node : A class for representing the node objects that compose a Tree.

C-extensions for this module

Some of the functions in this module are implemented in faster C code in the mutpath.ctree module. If this module is available, by default these faster implementations are used. The C implementations are automatically called when calling the Python function, so there is no need to call functions in mutpath.ctree directly.

The following functions can utilize fast C extensions:

  • GetBalancedIndex

Details of functions and classes

Detailed descriptions of the functions and classes are given in their individual docstrings below.

tree.ApplyToNodes(node, F)

Applies a function recursively to all descendents of a node.

node is a Node object.

F is a function that takes as input a node. This function applies F to node and to all of its descendents.

Typically, you will call this function with node as the root node of a tree, and then the function will be applied to the whole tree.

tree.AssignTimes(t)

Assigns times to all nodes of a tree.

t is a tree.Tree object for which we are assigning node times.

At the conclusion of this function, all nodes of t have times assigned in their node.info attributes under the key ‘time’. The value for this key is a number that gives the time for that node. Times are measured backwards from the most recent tip node. So the most recent tip node has a time of 0.0, and the root node has the largest value of time, which is the total height of the tree.

In addition, all nodes will also have a key in their node.info attribute of ‘time_since_root’ which gives the time of the node since the root. The value of this key is 0.0 for the root node, and is the total height of the tree for the most recent tip node.

tree.GetBalancedIndex(s, i, use_ctree=True)

Finds the index of the balancing parentheses character.

s is a string.

i is an integer index (0 <= i < len(s)). s[i] must be one of the characters “(”, ”)”, “[”, “]”, “{”, or “}” – otherwise an exception is raised.

use_ctree specifies that we use the faster implementations in the ctree C extension if available. Is True by default; if you set to False will use the pure Python code.

If s[i] is start bracket character such as “(”, then moves forward in the string to find and return the index of the balancing bracket character. If s[i] is an end bracket character, then moves backward in the string to find and return the index of the balancing bracket character. Raises an exception of the character cannot be balanced.

Examples:

>>> GetBalancedIndex('hi (what did you say)!', 3)
20
>>> GetBalancedIndex('hi (what did you say)!', 20)
3
>>> GetBalancedIndex('(507:12.32[&h="A,(B)"]),(234)', 0)
22
>>> GetBalancedIndex('(507:12.32[&h="A,(B)"]),(234)', 10)
21
tree.GetInfo(s)

Gets information from a information bracket.

s is a string that defines a valid information bracket in the BEAST annotation format. That is, it must start with ‘[&’ and end with ‘]’. These two brackets must represent balanced parentheses. There can then be an arbitrary number of pieces of information defined, in the form key=value where value is either a number or a string.

Returns a dictionary keyed by the keys as strings, and with the values as strings. If the value represents an integer, is converted into one. If the value represents a float, is converted into one. Otherwise, is kept as a string. If the string is enclosed in brackets, returns everything in the brackets.

Currently will not handle keys or values that contain the “=” character, and will not handle values with unbalanced parentheses. Raises a ValueError if the string cannot be parsed in this fashion. Also will raise a ValueError if it finds duplicate keys.

Example:

>>> s = '[&history_1={A:G:2.33},seq="ATGC",d=10.23,i=9]'
>>> GetInfo(s) == {'history_1':'{A:G:2.33}', 'seq':'ATGC', 'd':10.23, 'i':9}
True
tree.GetNode(t, key, value)

Gets a specific node from a tree.

t is a tree.Tree object.

key and value are used to specify how we locate the node. For nodes in the tree t, we look through the node.info attribute for each node for a key that matches key and specifies a value equal to value. This function returns the first node it finds where node.info[key] exists and specifies value. If no such node is found, returns None.

tree.GetPath(startnode, endnode, markpath=None)

Returns the path along the tree between two nodes.

startnode and endnode are two nodes from the same Tree object. This function traces the path from startnode to endnode.

The returned variable is 2-tuple of the form (commonancestor, pathlist) where commonancestor’ is the last common ancestor node, and *pathlist is a list of the form [startnode, node1, node2, ..., endnode] where node1, node2, etc represent intermediate nodes between startnode and node1. Initially, each node in the list will be the immediate ancestor of its predecessor in the list, until we have traced back to the last common ancstor. After that, each node in the list will be the descendent of its predecessor in the list.

markpath gives us the option to mark the nodes and branches by writing in their information dictionaries to indicate they are on the path. If markpath is set to a non-null string, then for each node and branch on the path, an entry is created in the dictionary keyed by that string, and with the integer value of one.

tree.ListsOfNodes(node, tip_list, internal_list)

Function that returns lists of all tip and internal nodes in a tree.

Typically you will call this function with node as a Node object giving the root node of a tree, and tip_list and internal_list as set to empty lists.

On return, tip_list is a list of all tip nodes in the tree with node as its root node. internal_list is a list of all internal nodes in the tree with node as its root.

If the method is called with node equal to something other than the root node, then the lists are of all nodes in the subtree with root node.

If the method is called with tip_list and internal_list as non-empty lists, then the tip and internal nodes are merely appended to the existing entries of these lists.

This function works by recursively calling itself.

class tree.Node(root=False, tip=False, name=None, ancestor=None, leftdescendent=None, rightdescendent=None, ancestorbranch=None, leftbranch=None, rightbranch=None, number=None, info={}, ancestorbranchinfo={}, rightbranchinfo={}, leftbranchinfo={})

Bases: object

Class for representing nodes of a phylogenetic tree.

Currently only competent for representing nodes on a bifurcating tree.

The calling arguments for creation of a Node object are specified in the __init__ method.

A node object n has the following public attributes:

  • n.tip : True if the node is a tip, False otherwise.
  • n.root : True if the node is a root node, False otherwise.
  • n.number : an integer that can be assigned to the node as an identifier. It is None of no integer has been set. Typically, this value is used to uniquely identify nodes in a tree, and is set during construction of the tree.
  • n.name : a string giving the name of a node, or None if no name exists.
  • n.info : a dictionary keyed by arbitrary strings, with values giving the corresponding value for that node property.
  • n.ancestor : the node that is the ancestor of this node, or None if the node has no ancestor. Typically there will be no ancestor if n.root is True.
  • n.rightdescendent : node that is the right descendent of this node, or None if there is no right descendent. Typically there will be no right descendent if n.tip is True.
  • n.leftdescendent : like n.rightdescendent, but for the left descendent.
  • n.ancestorbranch : the length of the branch to the ancestor, or None if no such length is set.
  • n.rightbranch : the length to the branch to the right descendent, or None if no such length is set.
  • n.leftbranch : like n.rightbranch, but for the left ancestor.
  • n.ancestorbranchinfo : a dictionary keyed by arbitrary strings, with values giving the corresponding value for the ancestral branch.
  • n.rightbranchinfo : like n.ancestorbranchinfo, but for the right branch.
  • n.leftbranchinfo : like n.ancestorbranchinfo, but for the left branch.
tree.PrintAllTreeInformation(tree, f=<open file '<stdout>', mode 'w' at 0x10a690150>)

Prints information about all nodes, names, and sequences in a tree.

tree is a Tree object specifying a phylogenetic tree.

f is a writeable filelike object. By default, it is set to sys.stdout. You can also set it to some other writeable file.

This function writes to f a FASTA file containing an entry for every node in tree. The headers begin with the node’s location in the tree, then whether this is an interior or tip node, and finally the node’s assigned name. The sequence is the value in node.info[‘sequence’] for each node. Each node is assumed to have an entry with this key of ‘sequence’ in its node.info dictionary. The nodes location is written as “ROOT” if the node is a root node, and otherwise as a string of comma delimited “LEFT” and “RIGHT” characters indicating how the node is reached from the root node. If no name is assigned for the node, the name is printed as “None”. For example, the function might write:

>ROOT INTERIOR None 
None
>LEFT INTERIOR None
None
>LEFT,LEFT TIP A/WSN/33 (H1N1)
MTAGCKL
>LEFT,RIGHT TIP A/PR/8/34 (H1N1)
MTAGNL
>RIGHT TIP A/Aichi/1968 (H3N2)
MSGGNL
tree.QueryTree(tree, nodelocation)

Gets information about a particular node of a tree.

tree is a Tree object specifying a phylogenetic tree.

nodelocation is a string giving the location of the node that we want to query from the tree. It is specified in terms of tracing the tree forward from its root node. A node location of ‘’ (empty string) refers to the root node. A node location of ‘right’ refers to the right descendent of the root node. A node location of ‘right, left, left’ refers to the node reached by beginning at the root node, taking the right descendent node, then this node’s left descendent, then that node’s left descendent.

If nodelocation fails to specify an existing node of the tree, then an exception is raised.

If nodelocation specifies an existing node, the function returns the 3-tuple: (tip, name, sequence) where:

  • tip is True if the node is a tip node, and False otherwise.
  • name is the string giving the name of the node, or None if no name is assigned to the node.
  • sequence is a string giving the sequence assigned to the node, or None if no sequence is assigned to this node.
tree.RecursivelyAssignDescendents(node, newick_tree, scalebranchlength, tipnames_dict, make_lengths_non_negative=False)

Recursively assigns the descendents to a node in a phylogenetic tree.

This method assigns the descendent nodes for node, and does the same for these descendent nodes, until the tips of the tree have been reached. The tips of the trees have names assigned. Branch lengths are assigned to all nodes.

CALLING VARIABLES:

node is a Node that is not a tip and has no descendents assigned.

newick_tree is a string giving the Newick tree that has node as its root. The same restrictions apply to newick_tree as those mentioned in the __init__ method of Tree.

scalebranchlength is an optional argument that allows us to rescale all branch lengths. If it has a value of None, then no rescaling is done. If it has another value, then it should be a number greater than zero. In this case, all branch lengths specified by newick_tree are multiplied by scalebranchlength.

tipnames_dict has a similar meaning as described in the __init__ method of Tree. It is used to assign properties to the tip nodes. It can still be None, meaning that no sequences are assigned. Otherwise it is a dictionary, with the keys being the nodes names and the values themselves being dictionaries giving key / value pairs to add to the node.info dictionaries. There must be a key for every tip node.

makes_lengths_non_negative is a Boolean switch specifying that we set any branch lengths that are less than zero to instead be zero. This is done if it is True.

RESULT OF CALLING THIS FUNCTION:

At the conclusion of this function, node will have its descendents assigned as specified by newick_tree. node will therefore represent the root of a tree.

tree.RecursivelyAssignTimes(node, origin_time, time_label)

For all descendents of a node, assigns the time since this node.

node is a Node object. Assigns this node an entry in node.info of with the key time_label (a string) and the value origin_time (a number). Then recursively assigns each descendent node a time equal to origin_time plus the length of the branch to that node.

Typically, you might call this method with node equal to the root node of a tree, origin_time equal to 0.0, and time_label equal to “time_since_root”. Then:

node.info["time_since_root"] = 0.0

and all descendent nodes will have the time since the root specified. In addition, the function returns a single number representing the largest value of the time for any node in the subtree. This number therefore represents the total height of the subtree rooted at node (if origin_time is zero on the call).

tree.RecursivelySetNumbers(node, number)

Function for setting the numbers for nodes of a tree.

Calling this function on the root node of a tree will set unique numbers for all nodes of the tree.

A node’s number is the value of node.number.

Calling this function with node as the root node and number set to zero will return an integer giving the number of nodes in a tree (tip and internal nodes combined). Typically, that is how you will use the function:

RecursivelySetNumbers(tree.GetRoot(), 0)

On call, node should be a Node object.

number should be an integer that is the number that is set for node.

Sets a number for node, and also recursively sets numbers for any of its descendents. The returned value is the next number (number + 1) that is not set for any node.

tree.RecursivelySetTipAndInternalNumbers(node, tip_number, internal_number)

Function for setting numbers for tip and internal nodes.

The node numbers are the values accessed by node.number. Any existing values for these numbers are cleared.

Typically this function will initially be called with node as the root node of a tree, and tip_number and internal_number both equal to zero. In this case, all of the tip nodes in the tree will be assigned integer numbers going from 0, 1, ... All of the inernal nodes in the tree will also be assigned integer numbers going from 0, 1, ... Crucially, the numbers for the internal nodes are assigned so that if nodej is a descendent of nodei, then nodej.number < nodei.number. This leads to the implication that the root node will have the smallest internal node number. Any existing numbers for the nodes are cleared. Note that it is NOT the case that all nodes in the tree have unique numbers; all tip nodes have unique numbers, and all internal nodes have unique numbers.

The returned value is the 2-tuple (n_tips, n_internal) where n_tips is the number of tip nodes and n_internal is the number of internal nodes.

In its implementation, this function works by recursively calling itself. In these recursive calls, the values of tip_number and internal_number are incremented as nodes are assigned numbers – the value of each of these corresponds to the next number that will be assigned to the next encountered tip or internal node.

tree.SplitNewick(newick_tree)

Splits a bifurcating phylogenetic tree into its next two branches.

newick_tree is a string specifying a sub-tree of a Newick tree. If the string specifies the full tree, the trailing semicolon and the parentheses surrounding the root must already be removed. In other words, the tree:

(((One:0.2,Two:0.3):0.3,(Three:0.5,Four:0.3):0.2):0.3,Five:0.7):0.0;

is not acceptable, but the tree:

((One:0.2,Two:0.3):0.3,(Three:0.5,Four:0.3):0.2):0.3,Five:0.7

is acceptable. The same limitations to the Newick tree apply as are listed in the documentation string for the __init__ method of Tree.

Returns the 2-tuple (left_branch, right_branch). The elements of this tuple specify the left and right branches of the tree, respectively. Each branch is itself a 5-tuple of the form (branch, length, tip, node_info, branch_info), where:

  • branch is a string giving the subtree if this branch is a subtree, or the name of the node if it is a tip.
  • length is a number giving the length of the branch leading to this tip or subtree.
  • tip is True if the branch is a tip or False otherwise.
  • node_info gives any information specified for the node.
  • branch_info gives any information specified for the branch.

Example:

>>> newick_tree = '((One:0.2,Two:0.3):0.3,(Three:0.5,Four:0.3):0.2):0.3,Five:0.7'
>>> (left_branch, right_branch) = SplitNewick(newick_tree)
>>> right_branch == ("Five", 0.7, True, {}, {})
True
>>> left_branch == ("(One:0.2,Two:0.3):0.3,(Three:0.5,Four:0.3):0.2", 0.3, False, {}, {})
True
tree.TallySiteChangesAlongTree(t)

Counts the number of times a site has changed along a reconstructed tree.

Takes as input a single argument t which shuld represent a Tree object. This tree must be rooted, and must have sequence assigned to both all tip nodes and all interior nodes. That is, the interior sequences must already be reconstructed by some method. These sequences should all be the same length (i.e., they should be aligned).

This method traverses along the tree starting at the root. For each site, it counts the number of times that the identity of that site has changed, and what these changes have been. It returns a dictionary tally keyed by sequence numbers (ranging from 1 to the length of the sequence). Entry tally[i] is the 2-tuple (n, changes). n is the total number of times site i is inferred to have changed identities. changes is itself a dictinary. For each transition from x to y at site i, changes[(x, y)] counts the number of times that transition occurred.

class tree.Tree(newick_tree, scalebranchlength=None, tipnames_dict=None, make_lengths_non_negative=False)

Bases: object

Class for representing a phylogenetic tree.

Currently only works for representing a rooted bifurcating tree.

The tree is composed of Node objects.

Calling arguments are documented in detail in the __init__ method.

GetNewickTree()

Returns the Newick string representing the original input tree.

GetRoot()

Returns the Node object that is the root of the tree.

WriteNewick(nodeinfo_keys={}, nodeinfo_res={}, branchinfo_keys={}, branchinfo_res={})

Writes the Newick string corresponding a tree.

This is NOT the original Newick string used to create the tree (for for that use the GetNewickTree method). Rather, nodes for this tree are labeled according to the arguments nodeinfo_keys and nodeinfo_res and branches for this tree are formatted according to branchinfo_keys and branchinfo_res.

These four arguments are all dictionaries. For any node or branch that has comments (specified in the node.info dictionary) with a key that matches a key in nodeinfo_keys or branchinfo_keys, a comment is printed in the tree. In addition, any comment with a key that matches a regular expression that is a key in nodinfo_res or branchinfo_res is printed. The values corresponding to the keys can be None, in which case the method tries to print the correct form of the value (will work for strings, integers, and floats). It can also be a function, in which case the method prints the string, integer, or float returned by that function which takes as input the node or branch information dictionary in question. If the function returns None, then nothing is printed for that value.

parse_tree Module

Module for various functions for parsing phylogenetic trees.

This method is designed to take as input the text representations of phylogenetic trees produced by programs such as BEAST, and convert them to a form represented by the Tree and Node objects defined in the tree module.

Written by Jesse Bloom.

List of functions defined in this module

TraceMutationPath : Finds the mutational path between two sequences on a tree.

AssignMutations : Assigns mutations to node branches.

BuildSeqsFromMutsAndTips : Builds node sequences from tip and mutation information.

RecursivelyAssignSeqsFromMuts : assigns sequences using parental sequence and mutations. GetTreeString : Gets the newick tree from a BEAST .tree file line.

ReadTreeTranslateBlock : Reads name translation block from .tree file.

Detailed documentation for functions

Detailed documentation for the functions is provided in their individual docstrings below.

parse_tree.AssignMutations(node)

Assigns mutations to node branches.

CALLING VARIABLES

  • nodes is a *tree.Node object. Typically, you would call this function with node as the root node of a tree. The mutations should be specified in the rightbranchinfo and leftbranchinfo attributes of the nodes using the key ‘history_all’ as detailed below.

RESULT OF THIS FUNCTION

This function recursively proceeds down the tree descended from node and assigns mutations to the branches.

The rightbranchinfo and leftbranchinfo for each node are assumed to hold the information about the mutations. Specifically, if these dictionaries have an entry keyed by the string ‘history_all’, then this history should list all mutations (the absence of such a key indicates no mutations). These should be listed in forms such as the following:

node.rightbranchinfo['history_all'] = '{{286,42.2151,A,S},{353,40.2638,S,C}}'

where the first number is the position (residue or nucleotide number) in 1, 2, ... numbering; the second number is the time of the mutation given in units such that the most current tip node has a value of 0 and the tree root has a value equal to the tree height, and the next numbers are the wildtype and then the mutant amino acids or nucleotides. This function creates an entry in the branchinfo dictionary that gives the following list, with the entries sorted from largest to smallest, which means that the list gives the mutations in chronological order since the largest times correspond to those in the most distant past:

node.rightbranchinfo['mutations'] = [(42.2151, 'A', 286, 'S'), (40.2638, 'S', 353, 'C')]

where specifically each entry in the list is the time since the root node, the wildtype identity, the position, and the mutant identity. If there are no mutations, an empty list is added for ‘mutations’.

parse_tree.BuildSeqsFromMutsAndTips(t, allowambiguousmismatches)

Builds node sequences from tip and mutation information.

This function takes a single argument t specifying a tree.Tree object. In order for this method to work, all nodes must have mutations assigned to their branchinfo dictionaries under the key ‘mutations’ in the format given by AssignMutations. In other words, you should have already called:

AssignMutations(t.GetRoot())

In addition, all of the tip nodes of the tree should have their sequences (does not matter if these are protein or nucleic acid) specified in their node.info dictionaries under the key ‘sequences’.

This method uses the tip sequences and the mutation information to determine the sequences for all nodes of the tree, and places these in the node.info dictionary under the key ‘sequence’. Furthermore, it makes sure that the tip sequences and mutations specified in the tree already are consistent, and raises an exception if they are not.

First, it chooses a tip node and uses that to reconstruct the root node by tracing back up the tree to the root. Then it starts from this root node and recursively reconstruct all interior nodes and to check all tip nodes.

allowambiguousmismatches is a list or set specifying any ambiguous characters that we might allow in tip nodes. In general, if a tip node contains an ambiguous character, then during the sequence building a mismatch gaps as ambiguous characters, then during the sequence building a mismatch can occur. Any character that is present in allowambiguousmismatches does not count as a mismatch. For protein sequences, you might want::

allowambiguousmismatches = ['-', 'X']

to allow ambiguous amino acids and gaps. For nucleotide sequences, you might want:

allowambiguousmismatches = ['-', 'N']

to allow ambiguous nucleotides and gaps.

parse_tree.GetTreeString(line, replace_with_null=[])

Gets the newick tree from a line from a BEAST .tree file.

line is a string giving a line representing a tree from a BEAST tree file.

replace_with_null is a list of regular expression patterns that we wish to remove from the tree string. The replacements are done in the order that the items are specified. They are replaced with ‘’ (nothing).

Returns the newick tree portion of line only, with all patterns matching those listed in replace_with_null having been removed.

parse_tree.ReadTreeTranslateBlock(treefile)

Reads name translation block from .tree file.

treefile is a string giving the name of a BEAST .trees file.

A .trees file (such as that created by BEAST) contains a block called ‘Translate’ under the ‘Begin trees’ heading, such as:

Begin trees;
    Translate
        1 'A/Aichi/2/1968_1968.50',
        2 'A/Akita/1/94_1994.50',
        3 'A/nt/60/1968_1968.50'
        ;

This block is assigning the numbers used in the Newick strings to the full strain names.

This function returns a dictionary keyed by the integer strain codes and with the values being the strings giving the full name. The integer strain codes are given as strings, not numbers. For example the above block would return:

{'1':'A/Aichi/2/1968_1968.50', '2':'A/Akita/1/94_1994.50', '3':'A/nt/60/1968_1968.50'}
parse_tree.RecursivelyAssignSeqsFromMuts(node, allowambiguousmismatches)

Assigns sequences using parental sequence and specified mutations.

node is a tree.Node object. node must have the following keys in its attribute dictionaries:

  • node.info must have the key ‘sequence’ and a value specifying a nucleotide or protein sequence for the node.
  • Unless node is a tip node (which is the case if node.tip == True), then node.rightbranchinfo and node.leftbranchinfo must both have the key ‘mutations’ specifying mutations in the format written by AssignMutations. These mutations should be numbered in 1, 2, ... numbering of the sequence specified under ‘sequence’ in the node.info attribute.

If node is a tip node (node.tip == True), this function does nothing. Otherwise, it uses the sequence specified in node.info and the mutations specified in node.rightbranchinfo and node.leftbranchinfo to build the sequences for the right and left descendents of node, and places these sequences under the key ‘sequences’ in the info attributes of these descendents. If a ‘sequences’ key already specifies a sequence for these descendents, then checks to make sure the sequences built from the ancestor and mutations matches this sequence – if they do not, raises an exception. This function then recursively calls itself on the descendent nodes.

If you call this function with node as the root node of a tree, this method will build the sequences for all nodes of the tree.

allowambiguousmismatches has the same meaning as in BuildSeqsFromMutsAndTips.

parse_tree.TraceMutationPath(t, startstrain, endstrain)

Finds the mutational path between two sequences along a phylogenetic tree.

Given a phylogenetic tree t with mutations assigned to branches and sequences assigned to nodes, this function traces the path along the tree from the tip node startstrain to the tip node endstrain and reconstructs the sequences along the way. This mutational path through sequence space is written to pathfile.

CALLING VARIABLES:

  • t is a phylogenetic tree as a tree.Tree object. All nodes should have mutations assigned to their branches in their node.ancestorbranchinfo, node.rightdescendentinfo, node.leftdescendentinfo attributes under the key ‘mutations’. In addition, all nodes should have sequences assigned in their node.info under the key ‘sequence’. In addition, all nodes should have their time preceding the most recent tip assigned in their node.info under the key ‘time’. In practice, you can ensure this is true by performing the following two function calls before calling this function:

    parse_tree.AssignMutations(t.GetRoot())
    parse_tree.BuildSeqsFroMutsAndTips(t, allowambiguousmismatches)
    tree.AssignTimes(t)
    

    In addition, the tip nodes should have names assigned in their node.info dictionaries under the key ‘strain’. These names are used to locate the tip nodes corresponding to startstrain and endstrain.

  • startstrain and endstrain are strings giving the names of the tip nodes that form the starting and ending sequences for the mutational path. There must be a unique tip node that has a node.info key ‘strain’ giving startstrain as its value – this is the starting sequences. The same is true for endstrain.

RESULT OF CALLING THIS FUNCTION

This function returns a string specifying the mutational path and the timing of mutations. All times are measured in units before the most recent tip node (so the root node would have the largest value of time). Here is an example of a path that might be produced:

startstrain_name A/Aichi/2/1968_1968.50
startstrain_seq ATGGCAATGGGCTAA 
startstrain_time 42.5
endstrain_name A/Brisbane/10/2007_2007.10
endstrain_seq ATGACGATTGGATAA
endstrain_time 3.9
commonancestor_seq ATGGCGATGGGCTAA
commonancestor_time 43.12713
startstrain_to_commonancestor_path
A6G : 42.713     
commonancestor_to_endstrain_path
G9T : 31.732
G4A : 25.1343
C12A : 10.134

In this output string, the first three entries give the name, sequence, and time for startstrain. The next three do the same for endstrain. The next two entries give the sequence and time of the most recent common ancestor of startstrain and endstrain. The mutational path is then specified, first tracing from startstrain to the common ancestor (so backwards along the tree), and then tracing from the common ancestor the endstrain. In each of these sections of the path, each line specifies a different mutation in 1, 2, ... numbering, followed by a colon and the time of the mutation (again measured backwards from the most recent tip).

trajectory Module

Module for representing mutational trajectories as directed graphs.

Represents mutational trajectories through sequence space, which is the space in which each node is a unique sequence and edges are directional connections between nodes corresponding to mutations.

These digraphs can be used to visualize mutational trajectories through sequence space. They are designed to be visualized using the GraphViz program.

Written by Jesse Bloom.

Functions defined in this module

WriteGraphVizTrajectory - Writes a GraphViz visualization of a Trajectory object.

WriteMutationDates - Writes files giving mutations dates and credible intervals.

WriteNodePersistence - Writes times that nodes persist (time before next mutation).

DistanceAlongPath - Distance along a mutational path.

HeuristicTraceback - Tries to find the last high weight predecessor of a node.

IteratePaths - Iterates of paths in a mutational path file.

Classes defined in this module

Trajectory - A class for representing a mutational trajectory through sequence space.

Detailed documentation for functions

Provided in the individual function documentation strings below.

trajectory.DistanceAlongPath(startseq, endseq, s)

Returns the distance along the mutational path.

This distance of a sequence s along the path is defined as the Hamming Distance between startseq and s minus the Hamming Distance between endseq and s plus the Hamming distance between startseq and endseq.

Sequences are not case sensitive.

trajectory.HeuristicTraceback(t, node, cutoff)

Traces back to find last high weight precessor of a node.

t is a Trajectory object.

node is the string name of a node in t.

cutoff is a number > 0 and <= 1.

This function starts at node, and traces back along the trajectory to return the first predecessor of node with a weight >= cutoff. It does this by tracing back from node along its highest weight incoming edge to that predecessor, and then from that predecessor along its highest weight edge, etc until we find a predecessor with weight >= cutoff. This approach is not absolutely guaranteed to find the first predecessor with weight > cutoff, but it should work for reasonable trajectories. But beware, hence the word ‘Heuristic’ in the name of this function.

The return value is the string name of the first high weight predecessor.

This function recursively calls itself.

trajectory.IteratePaths(pathfile)

Iterates over paths in a mutational path file.

pathfile should be a string giving a name of an input file specifying one or more mutational paths. These files are of the format created by mutpath_get_paths.py. The required format is detailed below.

This function will iterate over all paths in pathfile. For each path, it will return the tuple (startseq, starttime, endseq, endtime, caseq, catime, tocamuts, fromcamuts). The entries of these tuples are as follows. All sequences are converted to upper case, as are all letters in the mutation notations. The times are measured in units before the most recent tip of the tree. Tuple entries:

  • startseq is the starting sequence specified by startstrain_seq
  • starttime is the time of startseq specified by startstrain_time
  • endseq is the ending sequence specified by endstrain_seq
  • endtime is the time of endseq specified by endstrain_time
  • caseq is the common ancestor sequence specified by commonancestor_seq
  • catime is the time of caseq specified by commonancestor_time
  • tocamuts is a list of the mutations going from startseq to caseq, specified in the order they are listed in the file (should be along the path) as the 2-tuples of the form (‘A6G’, 42.713) where the entries are the mutation and then the time.
  • fromcamuts is like tocamuts, but for mutations going from caseq to endseq.

The format of pathfile is as follows. This file should list mutational paths as:

MUTPATH 1
startstrain_name A/Aichi/2/1968_1968.50
startstrain_seq ATGGCAATGGGCTAA 
startstrain_time 42.5
endstrain_name A/Brisbane/10/2007_2007.10
endstrain_seq ATGACGATTGGATAA
endstrain_time 3.9
commonancestor_seq ATGGCGATGGGCTAA
commonancestor_time 43.12713
startstrain_to_commonancestor_path
A6G : 42.713     
commonancestor_to_endstrain_path
G9T : 31.732
G4A : 25.1343
C12A : 10.134

MUTPATH 2
startstrain_name A/Aichi/2/1968_1968.50
startstrain_seq ATGGCAATGGGCTAA 
startstrain_time 42.5
endstrain_name A/Brisbane/10/2007_2007.10
endstrain_seq ATGACGATTGGATAA
endstrain_time 3.9
commonancestor_seq ATGGCGATGGGCTAA
commonancestor_time 44.12713
startstrain_to_commonancestor_path
A6G : 42.113     
G9T : 43.124
commonancestor_to_endstrain_path
G4A : 21.1343
C5A : 19.531
A5C : 19.402
C12A : 9.134

The file lists each of the paths numbered starting at 1. Within each path, the mutations are indicated with numbering starting at 1 for the first position in the sequence. The times for the mutations, the starting and ending strains, and the most recent common ancestor of these two strains, are also indicated. These times are measured in units before the most recent tip node (so the root node would have the largest value of time). The mutations must move from the starting to the ending sequence, and if multiple paths are specified, then they all must have the same starting and ending sequences.

class trajectory.Trajectory(pathfile, translateseqs=False, printprogress=False)

Bases: object

Class for representing a mutational trajectory through sequence space.

This class represents a mutational trajectory in sequence space. The trajectory is a directed graph consisting of nodes (sequences) and edges (mutations connecting nodes). The trajectory moves from one known sequence to another known sequence, passing through some number of uncertain intermediates (nodes). The trajectory is created by passing it a set of possible mutational paths from the starting to ending sequence. In the trajectory, the weight of each node corresponds to the fraction of paths that contain that sequence, while the weight of each edge corresponds to the fraction of paths that contain that edge. Note that if a path contains a node or edge more than once (which can happen if there are mutational cycles), the node or edge is still considered to have occurred once in that path for the purposes of assigning the weights.

Each Trajectory object t has the following attributes:

  • t.npaths : the number of individual paths used to construct the overall trajectory.
  • t.startseq : a string giving the starting sequence for the trajectory.
  • t.endseq : a string giving the ending sequence for the trajectory.
  • t.nodes : a dictionary keyed by strings representing the sequences for each node found at least once in the trajectory, and with values equal to the weight of that node (fraction of paths containing the node).
  • t.edges : a dictionary keyed by 2-tuples of strings (s1, s2) and values giving the weight of the directed edges from sequence s1 to s2.
  • t.mutations : a dictionary keyed by mutation strings of the form ‘G5A’ specifying mutations where the numbering is in 1, 2, ... For each mutation that occurs in at least one of the paths passed to this trajectory, there will be a key. The values are lists giving the times of occurrence for all occurrences of that mutation in the paths used to create this trajectory. If a mutation occurs more than once in a path, only the time for its first occurrence is listed. So the fraction of paths that contain some mutation m is t.mutations[m] / float(t.npaths). Note that if translateseqs is True, then the mutations specified here are only the protein mutations, not the nucleotide ones in the underlying nucleotide sequences.
  • t.mutationstoca : a dictionary keyed by mutation strings just as for t.mutations. Each mutation that is added to the lists in t.mutations can arise on the branch from the starting sequence to the common ancestor, or on the branch from the common ancestor to the ending sequence. The value of t.mutationstoca[mut] is the number of times that the first occurrence of mut is on the route from starting sequence to the common ancestor. So if mut is always on the path from the starting sequence to the common ancestor, then t.mutationstoca[mut] == len(t.mutations[mut]). If it is always on the path from the starting sequence to the ending sequence, then t.mutations[mut] == 0.
  • t.persistence : a dictionary keyed by the node sequences and with the values being a list of numbers. If a node occurs on a path, then the time for which the node sequence persisted before another mutation is listed (for the first occurrence of the node if it occurs multiple times on the same path). Note that it translateseqs is True, then these are the persistence times to the first non-synonymous mutation, as only those mutations change the node sequences. The total length of the list for each node will be equal to the number of paths that contained that node.
  • t.timetofirstmut : if translateseqs is False, then this is just equal to t.persistence. But if translateseqs is True, then the entries give the time after the occurrence of a node to the first mutation of any time – synonymous or nonsynonymous. In this case, entries in t.timetofirstmut will often be less than those in t.persistence, since the first mutation to a node will often by synonymous, which will change the nucleotide sequence but not the actual protein sequence node identity.

To create a Trajectory object t, use the command:

t = Trajectory(pathfile, translateseqs=False, printprogress=False)

pathfile should be a string giving a name of an input file specifying one or more mutational paths. These files are of the format created by mutpath_get_paths.py. They must be readable by the IteratePaths function.

translateseqs is an optional argument that is False by default. If it is set to True, then the sequences contained within mutpathfile are taken to represent coding nucleotide sequences, but the trajectory is built through protein sequence space. In other words, the nucleotide sequences in the paths are translated, and the trajectory is built from these translated sequences. All of the nodes and edges will therefore connect protein sequences. Note that no checking is made to ensure that the sequences translate properly: any stop codons are simply translated to ‘*’, codons containing gaps are translated to ‘-‘, and sequences that do not have lengths that are multiples of three have the extra one or two nucleotides truncated.

printprogress is a switch specifying that we print progress as we process paths. You may want to use this if you are processing a large number of paths and want to output the progress. By default it is False, meaning that nothing is printed. If you set it to an integer, it will then print to sys.stdout after processing every printprogress paths.

trajectory.WriteGraphVizTrajectory(t, graphvizfile, minweight, labelcutoff, nodenames=None, nodesize=0.9, ranksep=0.1, nodesep=0.2, penwidth=10, fontsize=40, arrowsize=1.4, fontname='Helvetica-Bold', rankdir='TB', startendasdiamonds=True)

Writes a GraphViz visualization of a Trajectory object.

This function creates a file graphvizfile that can be used to visualize the directed graph represented by a Trajectory object t. Graphviz is a freely available software package (http://www.graphviz.org/) for visualizing graphs. The trajectory is written in the DOT language (http://www.graphviz.org/doc/info/lang.html).

The areas of nodes and the widths of edges are proportional to their weights.

The color saturations of nodes and edges are linearly proportional to their weights.

The rank of nodes (for example, their vertical position when rankdir is ‘LR’) is ordered according to their distance along the path from the starting to ending sequence. This distance is defined as the Hamming Distance from the starting node minus the Hamming Distance from the ending node plus the Hamming distance between the starting and ending nodes.

CALLING VARIABLES:

  • t is the Trajectory object that contains the trajectory that we want to visualize.
  • graphvizfile is a string giving the name of the GraphViz input file that we want to create. It will typically end with the extension .dot. If this file already exists, it is overwritten. You should be able to open this file directly with Graphviz. The file is written in the DOT language (http://www.graphviz.org/doc/info/lang.html).
  • minweight is a number specifying the minimum weight that a node or edge must possess in order to be shown on the graph. Nodes or edges with weights < minweight are not included. Note that this creates a possibility for orphan nodes if a node has a weight >= minweight but all of its incoming and outgoing nodes have weights < minweight. To show all nodes and edges regardless of weight, set minweight to zero. However, this can sometimes lead to a very large graphvizfile since there can be a huge number of very low weight nodes / edges.
  • labelcutoff is the minimum weight that an edge must possess in order to be labeled on the graph. In addition, all nodes with weight >= labelcutoff have an incoming edge that is labeled. If there is not such an edge, then traces back to find the first predecessor node with weight labelcutoff and then draws a different colored edge spanning multiple mutations to connect these nodes. Generally, you would want labelcutoff > 0.5.
  • nodenames is an optional argument that allows you to specify names for nodes. It is None by default. If you set it to another value, it should be a dictionary. It is keyed by node sequences (which are the identifiers for the nodes in t.nodes) and the values are strings giving names that are used to label the nodes in the trajectory. These names are written on the nodes only if the weight for that node is >= labelcutoff.

OPTIONAL CALLING VARIABLES SPECIFYING FORMATTING DETAILS:

  • nodesize is the height of a node with weight.
  • ranksep is the separation between ranks, as fraction of nodesize.
  • nodesep is the minimum separation between nodes of the same rank, as fraction of nodesize.
  • penwidth is the pen width of an edge.
  • fontsize is the font size.
  • arrowsize is the size of the arrows.
  • fontname is the font style.
  • rankdir specifies the direction the ranks move. If set to ‘TB’ then the graph moves from top to bottom. If set to ‘LR’ then the graph moves from left to right.
  • startendasdiamonds is a Boolean switch. If True, we show the starting and ending nodes as diamonds rather than circles. We also make these starting and ending nodes larger in size to fit their full labels. If False, we make them circles with size proportional to weights like all other nodes.
trajectory.WriteMutationDates(t, labelcutoff, interval, datesfile, datesplot, lasttipdate)

Creates text file and plot showing dates of mutations.

For each mutation that occurs in at least labelcutoff fraction of the paths that form the trajectory t, this function writes the posterior median and a Bayesian credible interval for the date of first occurrence of that mutation. The posterior is taken over all paths that contain that mutation.

The output also provides information about whether the mutations are on the branch from the starting sequence to the common ancestor, or from the common ancestor to the starting sequence.

CALLING VARIABLES:

  • t is a Trajectory object that contains the mutations data.
  • labelcutoff is a number > 0 and <= 1. We write the dates of all mutations in t that have weights >= labelcutoff (occur in at least this fraction of the paths).
  • interval specifies the range of the Bayesian credible interval, for example a value of 0.9 means that we print the 90% credible interval.
  • datesfile is the name of the text file that we create which contains the dates and intervals. It is overwritten if it does not already exist.
  • datesplot is the name of the plot file that we create using matplotlib. This plot can only be created if matplotlib is available. So first check on this (for example using Plot.PylabAvailable(). If matplotlib is not available, or if you don’t want to make the plot, make this argument None. Otherwise make it the name of the PDF plot file that you want to create.
  • lasttipdate specifies the absolute units for the dates. Dates in t will be specified in units of time before the most recent tip. Here provide a number giving the date of the most recent tip, and the dates shown are then this number minus the time for each mutation.
trajectory.WriteNodePersistence(t, nodestowrite, interval, persistencefile, cutoff)

Writes times for which nodes persist before the next mutation.

The trajectory t specifies a set of nodes. For each node specified by nodestowrite and with weight >= cutoff, reports the time until that node experiences the next mutation that moves it to a new node. If t is a trajectory through protein sequence space that was created from nucleotide sequences (this will be the case if t was created with translateseqs = True), the next mutation that moves it to a new node is a nonsynonymous mutation. This method then also records the time until the first mutation of any type (synonymous or nonsynonymous) after the trajectory moves to nodes. The persistence is written as the posterior median from all paths containing plus the Bayesian credible interval specified by interval. The persistence times are written to the text file persistencefile.

CALLING VARIABLES:

  • t is a Trajectory object that contains the persistence data.
  • nodestowrite is a dictionary specifying for which nodes we write persistence times, and the names used when writing these nodes. It is keyed by node sequences (which are the identifiers for the nodes in t.nodes) and the values are strings giving names that are used to label the nodes in the output. However, there does not actually have to be a node with persistence data for each key in nodestowrite – if there is not persistence data for a node key, nothing is written for it.
  • interval specifies the range of the Bayesian credible interval, for example a value of 0.9 means that we print the 90% credible interval.
  • persistencefile is a string giving the name of the text file that we create which contains the persistence times. It has headers explaning its format. If this file already exists, it is overwritten. The order in which nodes is written is arbitrary.
  • cutoff is a weight cutoff (fraction of paths containing this node). We only write persistence times for nodes that are both in nodestowrite and have weights >= cutoff. This keeps us from writing persistence times for nodes that only occur rarely.

stats Module

Module for performing statistics operations operations.

List of functions

MedianCredibleInterval : Returns posterior median and centered credible interval.

Details of functions

Details of functions are listed below in their individual docstrings.

stats.MedianCredibleInterval(data, interval)

Returns posterior median and centered Bayesian credible interval.

data is a list of numbers (must contain at least one number). These are assumed to represent numbers sampled from a posterior distribution.

interval is a number with 0 < interval <= 1.

This function returns the following tuple (median, mininterval, maxinterval). In this tuple:

  • median is the median value in data. This represents the posterior median.
  • mininterval and maxinterval are numbers with maxinterval >= mininterval such that a fraction interval of that data in data are >= minterval and <= maxinterval. Furthermore, this interval is centered such that the fraction of the entries in data that are < minterval is the same as the fraction of entries that are > maxinterval. In other words, minterval and maxinterval specify the centered Bayesian credible interval.
>>> data = [0.1, 2, 1, 1.1, 2, 5, -2.1, 7.9, 2, 1.3, 8.1]
>>> (median, mininterval, maxinterval) = MedianCredibleInterval(data, 0.8)
>>> median == 2
True
>>> mininterval == 0.1
True
>>> maxinterval == 7.9
True