epitope_mapper.sequtils Module

This file contains utilities for manipulating sequences.

Performs functions such as reading / writing FASTA files, translating sequences, etc.

Written by Jesse Bloom.


Converts a three letter amino acid code into a one letter code.

The single input argument is the three letter amino acid code. It can be of any case.

This function returns, in upper case, the one letter amino acid code. It raises an exception if aathree is not a valid one letter code. ‘Xaa’ is converted to ‘X’.

>>> AAThreeToOne('Ala')
>>> AAThreeToOne('cys')
>>> AAThreeToOne('Xaa')
>>> AAThreeToOne('hi')
Traceback (most recent call last):
ValueError: Invalid amino acid code of hi.

Returns all possible nucleotides corresponding to an ambiguous code.

This method takes as input a single nucleotide character nt, which is assumed to represent a nucleotide as one of the accepted codes for an ambiguous character. Returns a list giving all possible codes for which a nucleotide might stand. Raises an exception if nt is not a valid nucleotide code.

>>> AmbiguousNTCodes('N')
['A', 'T', 'G', 'C']
>>> AmbiguousNTCodes('R')
['A', 'G']
>>> AmbiguousNTCodes('A')
>>> AmbiguousNTCodes('-')
>>> AmbiguousNTCodes('F')
Traceback (most recent call last):
ValueError: Invalid nt code of "F"
sequtils.ClassifyBySeason(genomes, startyear, endyear, subsample, season)
sequtils.CondenseSeqs(seqs, maxdiffs, exclude_positions)

Removes nearly identical protein sequences.

seqs is a list of sequences as (head, seq) 2-tuples. The sequences are assumed to be aligned.

maxdiffs specifies the maximum number of differences that a sequence can have from another sequence in order to be removed.

exclude_positions is a list of integers specifying positions at which sequence can NOT differ and still be removed. These integers are for numbering the sequences as 1, 2, ...

The method proceeds as follows:

  1. For the first sequence iterates through the rest of the sequences, and removes any that differ at <= ndiff sites, and do not differ at the sites specified by ‘exclude_position’.
  2. Repeats this process for each of the remaining sequences.

The returned variable is seqs with the removed sequences gone.

>>> seqs = [('s1', 'ATGC'), ('s2', 'ATGA'), ('s3', 'ATC-'), ('s4', 'ATGC')]
>>> CondenseSeqs(seqs, 0, [])
[('s1', 'ATGC'), ('s2', 'ATGA'), ('s3', 'ATC-')]
>>> CondenseSeqs(seqs, 1, [])
[('s1', 'ATGC'), ('s3', 'ATC-')]
>>> CondenseSeqs(seqs, 1, [4])
[('s1', 'ATGC'), ('s2', 'ATGA'), ('s3', 'ATC-')]
sequtils.DateToOrdinal(datestring, refyear)

Converts a date string to an ordinal date.

datestring is a date given by a string such as ‘2007/2/13’ (for Feb-13-2007), or ‘2007/2//’ if no day is specified, or ‘2007//’ if no day or month is specified. The ‘/’ characters can also be ‘-‘.

refdate is an integer year from the approximate timeframe we are examining which is used to anchor the datestring date on the assumption that each year has 365.25 days.

The returned value is a number (decimal) giving the date. If no day is specified, the 15th (halfway through the month) is chosen. If no month or day is specified, July 1 (halfway through the year) is chosen.

>>> print "%.2f" % DateToOrdinal('2007/4/27', 1968)
>>> print "%.2f" % DateToOrdinal('2007/4/', 1968)
>>> print "%.2f" % DateToOrdinal('2007//', 1968)
>>> print "%.2f" % DateToOrdinal('2007-4-27', 1968)
sequtils.FindMotifs(seq, motif)

Finds occurrences of a specific motif in a nucleotide sequence.

seq is a string giving a nucleotide sequence.

motif is a string giving the motif that we are looking for. It should be a string of valid nucleotide characters: * A Adenine * G Guanine * C Cytosine * T Thymine * U Uracil * R Purine (A or G) * Y Pyrimidine (C or T) * N Any nucleotide * W Weak (A or T) * S Strong (G or C) * M Amino (A or C) * K Keto (G or T) * B Not A (G or C or T) * H Not G (A or C or T) * D Not C (A or G or T) * V Not T (A or G or C)

The returned variable is a list motif_indices of the indices that each occurrence of motif in seq begins with. For example, if there is a motif beginning at seq[7], then 7 will be present in motif_indices. So the number of occurrences of the motif will be equal to len(motif_indices).

This function is not case sensitive, so nucleotides can be either upper or lower case. In addition, T (thymine) and U (uracil) nucleotides are treated identically, so the function can handle either DNA or RNA sequences.

>>> FindMotifs('ATCGAA', 'WCGW')
sequtils.GetEntries(namelist, fastafile, allow_substring=False)

Gets selected entries from a (potentially very large) FASTA file.

This method is designed to extract sequences from a FASTA file. It will work even if the FASTA file is very large, since it avoids reading the entire file into memory at once.

namelist specifies the “names” of the sequences that we want to extract from the FASTA file. The “name” of a sequence is the string immediately following the “>” in the FASTA file header for a sequence, terminated by a space character (space, tab, or return). So for example, the header:

>E_coli_thioredoxin: the thioredoxin protein from E. coli

would correspond to a name of “E_coli_thioredoxin”. ‘namelist’ specifies a list of these names.

fastafile is the name of a FASTA file that contains the sequences we are searching for. For this method to be guaranteed to work properly, each sequence in the FASTA file must contain a unique name, where a “name” is as defined above. Note that this uniqueness of names is not rigorously checked for, so if there are not unique names, the function may raise an exception, or it may continue along and give no hint of the problem.

allow_substring is an optional Boolean switch that specifies that the name given in namelist need only be a substring of the first entry in the fastafile header.

The function expects to find exactly one entry in fastafile for each name listed in namelist. If it does not, it will raise an exception. The returned variable is a list composed of 2-tuples. Element i of this list corresponds to the name given by namelist[i]. Each 2-tuple has the form ‘(header, sequence)’ where ‘header’ is the full FASTA header, but with the leading “>” character and any trailing linebreaks/spaces removed. ‘sequence’ is a string giving the sequence, again with the trailing linebreak removed.

sequtils.GetSequence(header, headers_sequences)

Gets a particular sequence based on its header name.

header specifies the name of a sequence’s FASTA header.

headers_sequences is a list of tuples ‘(head, seq)’ as would be returned by Read.

This function searches through headers_sequences and returns the sequence corresponding to the first header found that matches the calling argument header. If no such header is found, raises an exception.

sequtils.ParseInfluenzaGenomes(infile, genes)

Parses influenza genomes from FASTA file.

This function is designed to read a FASTA file that contains each of the proteins or genes for an influenza genome, and then return all strains with full genomes.


  • infile is a string giving the name of the FASTA file. The headers should be of the following format:

    >AAX11456 A/New York/61A/2003(H3N2) 2003/12/20 M1
    >AAB06984 A/Louisiana/4/93(H3N2) 1993// NA
    >CAD29965 A/Panama/2007/1999 1999// NA

    These headers given the sequence number, the strain name (it is optional whether the subtype, see the third example header), the year/month/date of isolation, and the gene.

  • genes is a list of all required gene segments, as listed in the header. For example, you might commonly have genes = [‘PB2’, ‘PB1’, ‘PA’, ‘HA’, ‘NP’, ‘NA’, ‘M1’, ‘M2’, ‘NS1’, ‘NS2’].


This function returns the dictionary genomes. This dictionary is keyed by the 4-tuples (strain, year, month, day). If no month is specified then month is None, otherwise it is the numeric month. Likewise for day. So the keys for the examples above would be:

  • (‘A/New York/61A/2003(H3N2)’, 2003, 12, 20)
  • (‘A/Louisiana/4/93(H3N2)’, 1993, None, None)
  • (‘A/Panama/2007/1999’, 1999, None, None)

The value for each key is another dictionary. It is keyed by each string gene in genes, and the values are the sequences for those genes. The sequences are all in upper case.

genomes only contains those strains for which each gene in genes has a sequence specified. If there are multiple sequences provided for a gene, takes the first one encountered. If there are multiple full genome entries for the same strain name with some providing more detailed month/day information, takes the one with more detailed month/day information.


Removes all sequences with ambiguous positions from nucleotide sequences.

This function takes a single calling argument headers_sequences, which is a list of tuples (header, seq) as would be returned by Read. These sequences should specify nucleotide sequences. It returns a new list which is a copy of headers_sequences, except that all sequences that contain ambiguous nucleotide entries (i.e. characters that are ‘A’, ‘T’, ‘C’, ‘G’, ‘a’, ‘t’, ‘c’, or ‘g’) have been removed.


Removes all duplicate sequences and those that are substrings.

This function takes a single calling argument headers_sequences, which is a list of tuples (header, seq) as would be returned by Read. It returns a new list in which each sequence appears exactly once. If a sequence appears more than once in the original calling list, the sequence that is kept is the first one encountered in the list. Sequences that are substrings of others are also removed.

The ordering of sequences is preserved except for the removal of duplicates. Sequences are returned as all upper case.

>>> PurgeDuplicates([('seq1', 'atgc'), ('seq2', 'GGCA'), ('seq3', 'ATGC')])
[('seq1', 'ATGC'), ('seq2', 'GGCA')]
>>> PurgeDuplicates([('seq1', 'atgcca'), ('seq2', 'TGCC')])
[('seq1', 'ATGCCA')]

Reads sequences from a FASTA file.

fastafile should specify the name of a FASTA file.

This function reads all sequences from the FASTA file. It returns the list headers_seqs. This list is composed of a 2-tuple ‘(header, seq)’ for every sequence entry in FASTA file. ‘header’ is the header for a sequence, with the leading “>” and any trailing spaces removed. ‘seq’ is the corresponding sequence.


Converts nucleotide sequences to their reverse complements.

The single input argument heads_seqs is a list of sequences in the format of tuples (head, seq) where head is the header and seq is the sequence. The sequences should all be nucleotide sequences composed exclusively of A, T, C, or G (or their lowercase equivalents). Ambiguous nucleotide codes are currently not accepted.

The returned variable is a copy of heads_seqs in which the headers are unchanged but the sequences are converted to reverse complements.

>>> ReverseComplement([('seq1', 'ATGCAA'), ('seq2', 'atgGCA')])
[('seq1', 'TTGCAT'), ('seq2', 'TGCcat')]
>>> ReverseComplement([('seq1', 'ATGNAA')])
Traceback (most recent call last):
ValueError: Invalid nucleotide code.
sequtils.Translate(headers_sequences, readthrough_n=False, readthrough_stop=False, truncate_incomplete=False, translate_gaps=False)

Translates a set of nucleotide sequences to amino acid sequences.

This function takes as input a single calling argument header_sequences, which is a list of tuples (header, seq) as would be returned by Read. The sequences should all specify valid coding nucleotide sequences. The returned variable is a new list in which all of the nucleotide sequences have been translated to their corresponding protein sequences, given by one letter codes. Stop codons are translated to ‘*’. If any of the nucleotide sequences do not translate to valid protein sequences, an exception is raised.

The optional argument readthrough_n specifies that if any nucleotides in the sequence are equal to to an ambiguous nt code and cannot therefore be unambiguously translated into an amino acid, we simply translate through these nucleotides by making the corresponding amino acid equal to “X”. By default, this option is False. Note that even when this option is False, certain ambiguous nucleotides may still be translatable if they all lead to the same amino acid.

The optional argument readthrough_stop specifies that if we encounter any stop codons, we simply translation them to ‘*’. By default, this option is False, meaning that we instead raise an error of an incomplete stop codon.

The optional argument truncate_incomplete specifies that if the sequence length is not a multiple of three, we simply truncate off the one or two final nucleotides to make the length a multiple of three prior to translation. By default, this option is False, meaning that no such truncation is done.

The optional argument translate_gaps specifies that a codon containing ‘-‘ is translated to ‘-‘.

>>> Translate([('seq1', 'ATGTAA'), ('seq2', 'gggtgc')])
[('seq1', 'M*'), ('seq2', 'GC')]
>>> Translate([('seq2', 'GGNTGC')])
[('seq2', 'GC')]
>>> Translate([('seq2', 'NGGTGC')])
Traceback (most recent call last):
ValueError: Cannot translate codon NGG
>>> Translate([('seq2', 'NGGTGC')], readthrough_n=True)
[('seq2', 'XC')]
>>> Translate([('seq2', 'TAATGC')])
Traceback (most recent call last):
ValueError: Premature stop codon
>>> Translate([('seq2', 'TAATGC')], readthrough_stop=True)
[('seq2', '*C')]
>>> Translate([('seq2', 'TGCA')])
Traceback (most recent call last):
ValueError: Sequence length is not a multiple of three
>>> Translate([('seq2', 'TGCA')], truncate_incomplete=True)
[('seq2', 'C')]
>>> Translate([('seq2', 'TGC---')])
Traceback (most recent call last):
ValueError: Cannot translate gap.
>>> Translate([('seq2', 'TGC---')], translate_gaps=True)
[('seq2', 'C-')]

Converts all unknown ambiguous amino acids to gap characters.

This function converts all of the common codes for unknown amino acids to gaps. That is, “B” (Asp or Asn), “Z” (Glu or Gln), and “X” (any amino acid) are all converted to the character “-”, which is usually taken to denote a gap. You would use this method if you are subsequently processing the sequences with a program that recognizes the gap character but not these three other unknown characters. Note, of course, that you are changing your sequence, so don’t do this unless you need to.

This method takes as a single calling variable the list header_sequences as would be returned by Read. It returns a copy of this list, but with all unknown amino acids replaced by “-”.

sequtils.Write(headers_seqs, filename, writable_file=False)

Writes sequences to a FASTA file.

headers_seqs is a list of 2-tuples specifying sequences and their corresponding headers. Each entry is the 2-tuple (header, seq) where header is a string giving the header (without the leading “>”), and seq is the corresponding sequence.

filename is a string that specifies the name of the file to which the headers and sequences should be written. If this file already exists, it is overwritten.

writable_file is a Boolean switch specifying that rather than filename giving a string specifying the name of a file to which the sequences should be written, it instead specifies a writable file object to which the sequences should be written.

The sequences are written to the file in the same order that they are specified in headers_seqs.

sequtils.WriteNEXUS(seqs, outfile, seqtype)

Write sequences to a NEXUS file.

seqs is a list of 2-tuples specifying sequences as (name, sequence). The sequences should all be aligned so that they are of the same length.

outfile is the name of the output file we are writing, such as ‘file.nex’

seqtype is the sequence type, such as ‘PROTEIN’ or ‘DNA’

Previous topic

epitopefinder.io Module

Next topic

epitopefinder.align Module

This Page