Module for reading and handling epitopes.
This module is for handling information about epitopes. It is designed principally with the goal of parsing epitopes from the Immune Epitope Database (www.iedb.org).
Written by Jesse Bloom.
Details of classes and functions are provided in their individual docstrings below.
Assigns an MHC allele to its mhcclass, mhcgene, and supertype.
Currently this only assigns alleles for ep.host equal to ‘Homo sapiens’ or some subset (such as ‘Homo sapiens Caucasian’).
CALLING VARIABLES:
RETURN VALUE
On return, ep will have its mhcclass, mhcgene, mhcgroup, and supertype attributes updated. Note that they may still be None if they are not assignable from ep.mhcallele.
This function will raise an exception if ep.mhcallele cannot be processed.
Bases: object
Class for storing epitope information.
This class defines Epitope objects, which can be used to store information about epitopes.
Each Epitope object ep possesses the following attributes. If the attribute is not defined for an epitope on initialization, then that attribute is set to None:
ep.host : a string giving the host organism, for example ‘Homo sapiens’
ep.sourceorganism : a string giving the source organism, for example ‘Influenza A virus’
ep.sourcemolecule : a string giving the source molecule, for example ‘Nucleoprotein’
ep.sequence : a string giving the sequence of the epitope, for example ‘GILGFVFTL’. If sequence is assigned upon initialization of ep, then it will be converted to all upper case.
ep.mhcallele : a string giving the MHC allele, for example ‘HLA-B*35:01’
ep.assay : a string giving the assay used identify the epitope, for example ‘ELISPOT; cytokine release IFNg’
ep.reference : a string giving the reference for the epitope, for example: ‘T Linnemann; G Jung; P Walden. J Virol. (2000). PMID 10954576’
ep.position : a 2-tuple giving the position in the protein as the starting and ending integer sequence positions, for example (46, 54)
ep.mhcclass : a string ‘I’ or ‘II’ specifying whether the epitope is MHC class I or MHC class II.
ep.mhcgene : a string specifying the MHC gene. For example, for humans could be ‘A’, ‘B’, ‘C’, ‘DQA1’, ‘DQB1’, ‘DPA1’, ‘DPB1’, ‘DRB1’, ‘DRB3’, ‘DRB4’, ‘DRB5’.
- ep.supertype : a string giving the supertype assigned to
an allele. For example, ‘A01’ or ‘B27’.
ep.mhcgroup : a string specifying the MHC group. For example, for ‘HLA-B*35:01’ this would be ‘35’.
To initialize an Epitope object, there are no required arguments, however, each of the above attributes can be assigned at initialization (unassigned attributes are set to None). For instance:
ep = Epitope(host='Homo sapiens', sequence='GILGFVFTL')
returns an Epitope object ep with ep.host set to ‘Homo sapiens’, ep.sequence set to ‘GILGFVFTL’, and all other attributes set to None.
Returns a string summary of epitope in CSV format.
ep is an Epitope object.
position is a 2-tuple of the format (start, end) indicating the starting and ending position to which ep aligns in the target sequence.
Returns a string in CSV format (entries each surrounded by quotes and separated by commas) of epitope ep and its alignment position. There are five entries:
- The epitope sequence.
- The starting and ending alignment position separated by a dash.
- The MHC allele information.
- Information about the epitope source / host and alignment position as taken from the source.
- The reference for the epitope.
Here is an example:
>>> ep = Epitope(assay='51 chromium release, cytotoxicity', mhcclass='I', reference='L G Tussey; S Rowland-Jones; T S Zheng; M J Androlewicz; P Cresswell; J A Frelinger; A J McMichael. Immunity. 1995. PMID 7542549', sequence='LRSRYWAI', sourcemolecule='NP', mhcallele='HLA-B*27:02', host='Homo sapiens', sourceorganism='Influenza A virus (A/X-31(H3N2))', mhcgroup='27', supertype='B27', position=(381, 388), mhcgene='B')
>>> position = (381, 388)
>>> s = EpitopeSummaryString(ep, position)
>>> print s
"LRSRYWAI","381-388","allele HLA-B*27:02; MHC class I; MHC gene B; MHC supertype B27; MHC group 27","host = Homo sapiens; source = Influenza A virus (A/X-31(H3N2)); molecule = NP; reported position = (381, 388); assay = 51 chromium release, cytotoxicity","L G Tussey; S Rowland-Jones; T S Zheng; M J Androlewicz; P Cresswell; J A Frelinger; A J McMichael. Immunity. 1995. PMID 7542549"
Summarizes allele classifications for epitopes.
eps should be a list of Epitope objects. Each of them should have been passed through AssignAlleleInfo.
out is a writable file-like object.
Writes a summary to out of the allele assignments for the epitopes in eps, as based on the mhcclass, mhcgene, mhcgroup, supertype attributes of the objects in this list. This summary contains information such as how many have each of these attributes assigned.
Purges potentially redundant epitopes.
Given a list of epitopes (epitopes) and the positions to which they align in a target protein (alignmentpositions), find all epitopes that overlap by >= overlap sites in their alignment to the target. These epitopes are then examined to see if they clearly come from a different MHC classification (according to the option mhc_classification) – if they do not, then the redundant epitopes are removed.
For this process, the epitopes are first sorted so that those with the shortest epitope sequence (best defined sequence) come first. Among epitopes with the same sequence lengths, those with the most detailed MHC classification are sorted to come first. We then move down this list asking which epitopes are redundant with these epitopes. Moving from shortest to longest ensures that distinct short epitopes that overlap the same longer epitope are not considered redundant.
CALLING VARIABLES:
epitopes is a list of Epitope objects specifying the epitopes. At a minimum, they should all have their ep.sequence attribute set to a string.
alignmentpositions is a list of 2-tuples of the same length as epitopes. For each epitope epitopes[i], the 2-tuple alignmentpositions[i] = (alignstart, alignend) gives the position that this epitope aligns to in the target protein. The epitope aligns to positions j where alignstart <= j <= alignend. These alignment positions are used to determine the extent to which two epitopes overlap in their alignment.
overlap is a string specifying the amount of overlap that two epitopes must possess before they are considered potentially redundant. Any two epitopes that overlap (as determined by alignmentpositions) by >= overlap residues are potentially redundant.
mhc_classification is how we determine whether two epitopes with >= overlap alignment are indeed redundant. Possible values are the following strings:
- MHCgene two overlapping epitopes are considered redundant if they share their ep.mhcgene attributes (or if one has this attribute set to None).
- MHCsupertype two overlapping epitopes are considered redundant if they share their ep.mhcgene attribute and if they share their ep.supertype attribute. An attribute set to None is considered to match any other value of the attribute.
- MHCgroup two overlapping epitopes are considered redundant if they share their ep.mhcgene, ep.supertype, and ep.mhcgroup attributes. An attribute set to None is considered to match any other value of the attribute.
RETURN VARIABLE:
This function returns the 4-tuple (unique_epitopes, unique_positions, redundant_epitopes, redundant_positions). All three elements of this 3-tuple are lists of the same length. The list unique_epitopes contains one representative of each set of redundant epitopes, and it always contains the one with the shortest ep.sequence. For each epitope unique_epitopes[i], the element unique_positions[i] is a 2-tuple of numbers specifying the epitope alignment position as in alignmentpositions. For each element i of unique_epitopes, the element redundant_epitopes[i] is a list of all other epitopes redundant with unique_epitopes[i]. If there are no such redundant epitopes, then redundant_epitopes[i] is an empty list. For each element j of redundant_epitopes[i], the element redundant_positions[i][j] is the 2-tuple of numbers specifying the epitope alignment position for redundant epitope redundant_epitopes[i][j] as in alignmentpositions.
Reads epitopes from compact CSV downloads from Immune Epitope Database.
This script reads the compact form of the CSV file downloads that can be made of sets of epitopes from the Immune Epitope Database (www.iedb.org). If the format of these CSV files is radically reconfigured then this function may no longer work – howevever, it is designed to hopefully raise an Exception in that case rather than give spurious output. It functioned well on files downloaded on April-15-2013, and will presumably function on downloads from other dates although that has not been confirmed.
infile should be a string giving the name of a CSV file download. It is assumed that this file contains only epitopes for which the assay result is ‘Positive’. If it contains entries for which the result is not ‘Positive’, then raises an exception.
The function returns a list epitopelist where each entry is an Epitope object containing the relevant information for an epitope in infile.
Reads in supertype classification scheme.
supertypesfile is the name of a file specifying the supertype classifications. This file contains lines that list the alleles it should contain lines listing at the mhcgene*mhcgroup:protein level of detail, followed by the supertype. Supertypes of ‘Unclassified’ are assigned None:
A*01:01 A01
A*01:02 Unclassified
A*68:15 A02
B*07:43 B07
B*08:01 B08
unclassified_to_none is an optional argument which is True by default. It specifies that if an allele has a supertype classification of ‘Unclassified’, in the returned dictionary then the value is set to None for the supertype.
This function returns a dictionary keyed by the alleles (at the mhcgene*mhcgroup:protein level, and with values being the supertype classification for that allele.
Splits CSV line where entries can be in quotes.
This function takes a single argument line, which is a string specifying a line from a CSV (comma-separated values) file. The line is split into entries based on the comman, with the exception that entries themselves can be in quotes and a quoted entry is not split even if it has a comma. This makes this function different than just line.split(‘,’) which will split entries even if the comma is in quotes. However, this function currently does not handle lines with quotes as part of the line entries.
The returned value is a list of the entries in the line, with leading / trailing whitespace removed.
>>> SplitCSVLine('entry1,entry2,entry3')
['entry1', 'entry2', 'entry3']
>>> SplitCSVLine('"entry1","entry2,two",entry3 ')
['entry1', 'entry2,two', 'entry3']
>>> SplitCSVLine('entry1,entry"2",entry3')
Traceback (most recent call last):
...
ValueError: entry contains internal quote
>>> SplitCSVLine(' ')
[]
>>> SplitCSVLine(',,entry3 ')
['', '', 'entry3']