phylodistances Module

Module for analyzing pairwise distances generated by ``phyloExpCM_optimizeHyphyTree.py.

Written by Jesse Bloom.

Functions defined in this module

  • ParsePairwiseDistances : parses pairwise distances
  • YearVersusDistance : parses year versus distance
phylodistances.ParsePairwiseDistances(distancesfile, equaltol=0.001)

Parses pairwise distances between sequences.

distancesfile should give the name of an existing file containing pairwise distances between sequences. Typically this would be the type of file produced by phyloExpCM_optimizeHyphyTree.py. Each line should contain three entries delimited by tabs. The first entry is the name of sequence 1, the second entry is the name of sequence 2, and the third entry is the numerical pairwise distance between them. Here are a few example lines:

1983.14_STRAIN_A/Kentucky/UR06-0372/2007_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1 1957.50_STRAIN_A/Baylor/4052/1981_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1    2.109830328352041
1983.14_STRAIN_A/Kentucky/UR06-0372/2007_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1 1972.50_STRAIN_A/Nanchang/25/1996_HOST_Human_SUBTYPE_H1N1_COUNTRY_China_n1  0.8163390477352853

This function returns a dictionary keyed by 2-tuples giving the sequence names. For each sequence pair, two 2-tuple keys are generated: (seq1, seq2), and (seq2, seq1), both with the pairwise distance as the value.

Error checking is performed to make sure that any duplicate entries have the same distance within a tolerance of equaltol.

phylodistances.YearVersusDistance(distancesfile, startseq, include_matches, exclude_matches)

Gets list of years of separation versus distances

  • distancesfile should give the name of an existing file containing pairwise distances between sequences. Typically this would be the type of file produced by phyloExpCM_optimizeHyphyTree.py. Each line should contain three entries delimited by tabs. The first entry is the name of sequence 1, the second entry is the name of sequence 2, and the third entry is the numerical pairwise distance between them. Here are a few example lines:

    1983.14_STRAIN_A/Kentucky/UR06-0372/2007_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1 1957.50_STRAIN_A/Baylor/4052/1981_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1    2.109830328352041
    1983.14_STRAIN_A/Kentucky/UR06-0372/2007_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1 1972.50_STRAIN_A/Nanchang/25/1996_HOST_Human_SUBTYPE_H1N1_COUNTRY_China_n1  0.8163390477352853
    

    Each sequence name should begin with a number giving the year of isolation, followed by an underscore, as in the example above. This number is the year assigned to the sequence.

  • startseq should be the name of a sequence listed in distancesfile. Typically this would be a sequence for which you have computed many pairwise distances to other sequences. For instance, for the example above you might make startseq = “1983.14_STRAIN_A/Kentucky/UR06-0372/2007_HOST_Human_SUBTYPE_H1N1_COUNTRY_USA_n1”. An exception is raised if *startseq is not found at least once in distancesfile.

  • include_matches is a list of re regular expression objects. We consider distances from startseq to all other sequences with names that do match with at least one of these objects.

  • exclude_matches is a list of re regular expression objects. We do not consider distances from startseq to any other sequence with names that match with at least one of these objects. Being present in exclude_matches overrides being present in include_matches

The return value is a list of tuples of the form (yearseparation, distance, startseq, otherseq). There is an entry for all sequences in distancesfile that are in include_matches and are not present in exclude_matches. There is also always an entry for otherseq equal to startseq. yearseparation is a number giving the separation of the sequence dates in years (year of otherseq minus year of startseq), and distance is the distance specified in distancesfile.