epitopefinder.align Module

This module contains functions to run sequence alignment programs.

It can be used to run PROBCONS or MUSCLE.

Written by Jesse Bloom.

align.AddDots(aligned_headers_seqs)

Adds dots at identities in multiple sequence alignment.

Takes as an argument a list of two or more aligned sequences, as would be returned by Align.

Returns a copy of this list. The first sequence in the list is unchanged. In all remaining sequences, dot characters (”.”) have been used to replace any amino acids that are identical to the amino acid at the same position in the first sequence, except for gap characters. Characters are replaced even if they are not of the same case

>>> AddDots([('seq1', '-TGC'), ('seq2', 'AGGC'), ('seq3', '-tac')])
[('seq1', '-TGC'), ('seq2', 'AG..'), ('seq3', '-.a.')]
align.Align(headers_seqs, progpath, program='MUSCLE')

Performs a multiple sequence alignment of two or more sequences.

By default, the protein sequences are aligned using MUSCLE. This program can be used to align either nucleotide or protein sequences. You can also use PROBCONS to align protein sequences.

headers_seqs is a list specifying the names of the sequences that we want to align. Each entry is a 2-tuple (head, seq) where head is a header giving the sequence name and other information (might be empty) and seq is a string giving the protein sequence. The list must have at least 2 entries.

progpath specifies a directory containing the alignment program executable, either PROBCONS or MUSCLE. The PROBCONS executable is assumed to have the name “probcons” in this directory. The MUSCLE executable is assumed to have the name “muscle” in this directory.

program specifies what program to use for the alignment. By default, it is “MUSCLE”. If you wish to use PROBCONS instead, set it to “PROBCONS”.

This executable is used to perform a multiple sequence alignment with the default settings of PROBCONS or MUSCLE. The returned variable is a new list aligned_headers_seqs. Each entry is a 2-tuple (head, aligned_seq). head has the same meaning as on input (the sequence header) and aligned_seq is the aligned sequence, with gaps inserted as ‘-‘ as appropriate. Therefore, all of the aligned_seq entries in aligned_headers_seqs are the same length. Entries in aligned_headers_seq are in the same order as in the input list headers_seqs.

align.GetEpitopeAlignments(epitope, targetseqs, maxmismatches, musclepath)

Finds alignments of an epitope to one or more target sequences.

epitope is a string giving the sequence of an epitope sequence.

targetseqs is a list of one more (header, sequence) tuples. These tuples give the target protein sequences to which we attempt to align epitope. If there are multiple sequences, they must be aligned (i.e. all of the same length).

maxmismatches is the maximum number of mismatches that are allowed for an epitope to still be considered a match.

musclepath is the path to a directory containing the MUSCLE alignment program.

This function uses MUSCLE to align epitope to each of the sequences in targetseq. If it aligns with <= maxmismatches mismatches and with no gaps, then this is considered a valid alignment. The possible outcomes are:

  • A return value of False if epitope does not align to any of the sequences in targetseqs with the specified identity and no gaps.
  • A 3-tuple of the form (alignstart, alignend, heads) if epitope aligns to one or more of the sequences in targetseqs. In this case, epitope must align to the same position in all of the target sequences to which it aligns. alignstart is the index of the first position in the target sequence in which this alignment starts (in 1, 2, ... numbering of the sequence as it appears in targetseq), alignend is the index of the last position in the target sequence (1, 2, ... numbering), and heads is a list of the headers for all sequences in targetseqs to which a successful alignment at these indices was found.
  • An Exception is raised if epitope aligns to multiple sequences in targetseqs but with different indices.
align.PairwiseStatistics(aligned_headers_seqs)

Computes the number of gaps and identities in a pairwise alignment.

This method is designed to compute statistics about an alignment of two
sequences.

aligned_headers_seqs is a pair of aligned sequences, as would be returned by calling Align with two sequences (which should be of the same length since they have been aligned). That is, it is a list of 2-tuples:

[(head1, alignedseq1), (head2, alignedseq2)]

The method returns the 2-tuple (identities, gaps). identities is a number between zero and one. It is the fraction of residues in one sequence that are aligned with identical residues in the other sequence, gaps not being included in the tally. This is computed by dividing the number of identities by the total length of the aligned sequence excluding gaps. gaps is the fraction of gaps in the alignment. It is the fraction of the positions in the alignment where either sequence has a gap. So it is computed by dividing the total number of gaps by the length of the aligned sequences. Upper and lower case nucleotides are treated equivalently.

>>> print round(PairwiseStatistics([('seq1', 'TGCAT'), ('seq2', 'AG-AT')])[0], 2)
0.75
>>> print round(PairwiseStatistics([('seq1', 'tgcat'), ('seq2', 'AG-AT')])[1], 2)
0.2
align.RemoveDots(aligned_headers_seqs)

Removes dots at identities in a multiple sequence alignment.

This function effectively undoes what can be done by ‘AddDots’.

Takes as an argument a list of two or more aligned sequences in the form of (header, sequence) tuples, as would be returned by Align.

Returns a copy of this list. Any positions where one of the sequences after the first sequence has a ‘.’ character, the amino acid is changed to that found at the same position in the first sequence.

>>> RemoveDots([('seq1', '-TGC'), ('seq2', 'AG..'), ('seq3', '-.a.')])
[('seq1', '-TGC'), ('seq2', 'AGGC'), ('seq3', '-TaC')]
align.StripGapsToFirstSequence(aligned_headers_seqs)

Strips gaps from a reference sequence, and all corresponding alignments.

On input, aligned_headers_seqs should be a set of two or more aligned sequences, as would be returned by Align.

The first sequence in this alignment is taken to correspond to the reference sequence. The returned variable is a list similar to aligned_headers_seqs, but with all positions corresponding to gaps in this reference sequence stripped away. All gaps (‘-‘) characters are removed from this reference sequence. In addition, in all other aligned sequences in aligned_headers_seqs, every character at the same position as a gap in the reference sequence is removed. Therefore, at the end of this procedure, all of the alignments have the same length as the reference sequence with its gaps stripped away. The headers are unchanged. The order of sequences in this stripped alignment is also unchanged.

>>> StripGapsToFirstSequence([('s1', '-AT-A-GC'), ('s2', 'AAT-TAGC'), ('s3', '--T-A-GC')])
[('s1', 'ATAGC'), ('s2', 'ATTGC'), ('s3', '-TAGC')]
align.StripLeadingTrailingGapsToFirstSequence(aligned_headers_seqs)

Strips leading / trailing gaps from first sequence, and trims corresponding alignments.

On input, aligned_headers_seqs is a set of two or more aligned sequences, as would be returned by Align.

The first sequence in the alignment corresponds to the reference sequence. The returned variable is a list similar to aligned_headers_seqs, but all leading / trailing gaps have been stripped from the reference sequence. A leading gap (‘-‘) is one that precedes the first non-gap character; a trailing gap is one that follows the last non-gap character. All other sequences have all positions corresponding the leading/trailing gaps of the reference sequence trimmed as well. The headers and order of sequences are preserved.

>>> StripLeadingTrailingGapsToFirstSequence([('s1', '--ATA-GC-'), ('s2', 'TGATTA-CA')])
[('s1', 'ATA-GC'), ('s2', 'ATTA-C')]

Previous topic

epitope_mapper.sequtils Module

Next topic

epitopefinder.plot Module

This Page