mapmuts_siteprofileplots.py¶
- This program is no longer the state-of-the-art: You are suggested to instead
- use the dms_tools program
dms_logoplot
.
This script is for the visual display of site-specific amino-acid preferences or differential preferences. The primary output of this script is a sequence logo plot. The script can also overlay some information about aspects of protein structure (secondary structure or solvent accessibility, as well as other user-specified information).
In the notation used here, \(\pi_{r,a}\) indicates the preference of site r for amino acid a, where \(\sum_a \pi_{r,a} = 1\) since each site always contains one of the amino acids. The differential preferences are denoted by \(\Delta\pi_{r,a}\) (where \(\sum_a\Delta\pi_{r,a} = 0\)), and indicate the change in preference for amino acid a at site r.
This script is designed to handle the amino-acid preferences in the *_equilibriumpreferences.txt
files created by mapmuts_inferpreferences.py (it can also be used to analyze the *_equilibriumfreqs.txt
files produced by mapmuts_inferenrichment.py). It can also handle the differential preferences in the *_differentialpreferences_selection_*.txt
files output by mapmuts_inferdifferentialpreferences.py.
Note that the sequence logo plot produced by this script isn’t the standard logo plot in which the stack height is scaled according to the site entropy. Rather, the height of each stack is the same, and the height of amino acid a within a stack is proportional to \(\pi_{r,a}\) or \(\Delta\pi_{r,a}\). So the heights of the letters given the preference or differential preference of the site for that amino acid.
Dependencies¶
This script requires:
- matplotlib to make the plots.
- weblogo to make the sequence logo plots.
- pyPdf to overlay the secondary structure and solvent accessibility information onto the sequence logos plots.
- Optionally requires scipy to compute correlations (they will not be computed if scipy is not available).
Running the script¶
To run this script, simply create an Input file with the format described below. If you name your Input file siteprofileplots_infile.txt
, then run the command:
mapmuts_siteprofileplots.py siteprofileplots_infile.txt
Input file¶
The input file is a text file with a series of key / value pairs. The required keys are indicated below. The values should not include spaces.
Lines beginning with # and empty lines are ignored.
Whether we are plotting site preferences or differential preferences is determined by whether you specify sitepreferences or differentialsitepreferences as a key in the input file.
Keys for the input file:
sitepreferences : You must specify either this key or differentialpreferences, but only one of them. If you use this key, it should specify the name of the file giving the site preferences \(\pi_{r,a}\). Typically this would be the
*_equilibriumpreferences.txt
file created by mapmuts_inferpreferences.py, although it could also be the*_equilibriumfreqs.txt
file created by created by mapmuts_inferenrichment.py or mapmuts_enrichmentgeomeans.py. The file has a header, and then for each site r there is a line giving the site entropy \(h_r\) and all of the preferences \(\pi_{r,a}\). It is acceptable for a preference for a stop codon (denoted by a * character) to be either present or absent – but all other 20 amino acids must always be present. The site numbers listed in the file must be consecutive (i.e. there cannot be gaps) within siterange. Here is an example of a few lines of such a file:#SITE WT_AA SITE_ENTROPY PI_A PI_C PI_D PI_E PI_F PI_G PI_H PI_I PI_K PI_L PI_M PI_N PI_P PI_Q PI_R PI_S PI_T PI_V PI_W PI_Y PI_* 1 M 0.885694 0.000680047 0.00185968 8.48048e-06 0.00128217 0.0289946 0.000654589 0.0212144 0.0205306 0.0021597 0.00056726 0.876551 0.00130602 0.000789781 0.00144568 0.000414457 0.00071164 0.00126055 0.00167084 0.0352283 0.00176486 0.000905383 2 A 1.1671 0.709575 0.00177747 3.79503e-05 0.00298854 0.00390524 0.00131246 0.00379599 0.000913006 0.00121497 0.000535209 0.00276335 0.00140061 0.000694694 0.0032134 0.00464972 0.000463924 0.000908368 0.254241 0.00335749 0.00137437 0.000876985 3 N 3.2593 0.129992 0.000540239 4.2514e-06 0.00171887 0.0212062 0.0006086 0.102743 0.0431489 0.00524637 0.0771062 0.021714 0.0641921 0.00338082 0.0149898 0.0106522 0.13886 0.0247841 0.00607319 0.0500692 0.282389 0.000581555
differentialpreferences : You must specify either this key or sitepreferences, but only one of them. If you use this key, it should specify the name of a
differentialpreferences_selection_*.txt
file created by mapmuts_inferdifferentialpreferences.py,. The file has a header, and then for each site r there is a line giving the root-mean-square (RMS) differential preference \(\Delta\pi_{r,a}\) and all of the differentail preferences. It is acceptable for a differential preference for a stop codon (denoted by a * character) to be either present or absent – but all other 20 amino acids must always be prsent. The site numbers listed in the file must be consecutive (there cannot be gaps) with siterange. Here is an example of a few lines:#SITE WT_AA RMS_dPI dPI_A dPI_C dPI_D dPI_E dPI_F dPI_G dPI_H dPI_I dPI_K dPI_L dPI_M dPI_N dPI_P dPI_Q dPI_R dPI_S dPI_T dPI_V dPI_W dPI_Y dPI_* 1 M 0.033508 -0.0041802 -0.00160924 -0.00102518 -0.00293463 -0.00112235 -0.000467141 -0.00262952 -0.0032969 -0.00679166 -0.00304567 0.0300719 0.00925138 -0.000442459 -8.11477e-05 -0.00129343 -0.00128639 0.000560046 -0.00243062 -0.00313798 -0.002528 -0.00158079 2 K 0.0630635 -0.0159228 0.0128652 -0.00367557 -0.0167789 0.005023 -0.000661766 0.0260142 -0.0040351 0.0025629 0.00523521 0.00449474 0.00192782 0.0189381 0.0171668 -0.00881324 -0.00971018 0.00748405 0.0136674 -0.00289523 -0.0288229 -0.0240638
outfileprefix specifies the prefix affixed to each of the Output files detailed below. outfileprefix can include a directory as part of the prefix (such as
./plots/replicate-1_
), but if such a directory is included in part of the prefix name then the directory must already exist or an error will be raised – this script does not create new directories. If you do not want to affix any prefix to the Output files, set outfileprefix to None. Any existing files that have the same names specified by outfileprefix will be overwritten.siterange specifies the range of residues that are included in the plot. The value should be two numbers giving the first and the last residues to include in the plots. You would use this option if you want to exclude certain residues (such as the first one in the example above). If you want to include all of the residues, just set to this to be the string all. There must be residue numbers in sitepreferences for all sites in siterange.
dsspfile is an optional argument that allows you to compare the observed site entropies and preferences to the relative solvent accessibility (RSA) and the secondary structure for the sites. This will only be useful if a crystal structure for your protein is available. If you do not want to use this option, just set dsspfile to None or do not specify this key at all.
You can then use the DSSP webserver to calculate the secondary structures and the RSAs. If you do want to use this option, then run DSSP on your protein structure – this script is tested against output from the DSSP webserver, but should probably work on output from the standalone too. Then save the DSSP output in a text file, and specify the path to that file as the value for dsspfile. This script does not currently have a robust ability to parse the DSSP output, so you have to do some careful checks. In particular, you must make sure that residue numbers in the PDB file exactly match the residue numbering scheme used for the rest of this analysis (i.e. the same residue numbers found in sitepreferences, and that none of the residue numbers contain letter suffixes (such as 24A) as is sometimes the case in PDB files. It is not necessary that all of the residues be present in the PDB. If there are multiple PDB chains, you can specify them using the dsspchain option. DSSP only calculates absolute accessible surface areas (ASAs); the RSAs are computed by normalizing by the maximum ASAs given by Tien et al, 2013.
dsspchain is an option that is only required if dsspfile is set to a value other than None. In this case, it should specify the chain in the PDB file that we want to use. If there is only one chain, you can set this option to None.
add_rsa is an option that is meaningful only if dsspfile is being used. This option specifies that the relative solvent accessibilities (RSAs) are shown in a color-coded bar above the sequence logo plot. If you do not want to do this, specify a value of False or simply do not specify this key. If you do want to show the RSA bar, then specify a value of True. The RSAs are the fractional solvent accessibilities. If you specify add_rsa to True, then you must also specify one and only one of either add_ss or add_custom as True.
add_ss is an option that is meaningful only if dsspfile is being used. This option specifies that the sequence logo plot contains a bar showing the secondary structure. If you do not want to do this, specify a value of False or simply do not specify this key. If you do want to show the secondary structure bar, then specify a value of True. The secondary structures are classified as helix (DSSP categories G, H, and I), strand (DSSP categories B and E) or loop (any other DSSP category). If add_ss is True, then add_rsa must be True and add_custom must be False.
add_custom is an option that can be used to plot a custom discrete property using bars above the sequence logo plot. You do not need to specify any option for add_custom – if nothing is specified, that is the same as setting add_custom to False. Currently, if add_custom is True then add_rsa must also be True and add_ss must be False. The usage of add_custom is as follows:
add_custom antigenic antigenic_sites.txt binding binding_sites.txt
In this example, add_custom is being used to label antigenic sites and receptor binding sites with the overlay bar. You can specify multiple properties with add_custom (in the above example, there are two). Each property should first have a name (in this example these are antigenic and binding) that does not contain a space. Each property name should then be followed by the name of a file (in this example, there are antigenic_sites.txt and binding_sites.txt) that also does not contain a space. In these files, any line that is empty or begins with a # character is ignored. All other lines specify sites to be labeled with property name. The site is specified by the first entry on the line, which should be the residue number to label. The line can have additional entries, but they are ignored. Each site can only be specified as belonging to one property, so in the above example a site cannot simultaneously be classified as antigenic and binding.
nperline specifies how many sites are included per line in the sequence logo plot. For example, if you have a protein of length 500 and you set nperline to 100, then the sequence logo plot will span 5 lines each showing 100 sites. In general, reasonable values are probably in the range from 50 to 100, but you might adjust based on the size of your protein.
includestop specifies whether we include stop codons as possible amino acid identities if they are specified in sitepreferences or differentialpreferences (stop codons are denoted by a * character). If includestop is False, then stop codons are not considered a possible amino acid in the plots created here, and all of the other preferences are normalized so that the \(\pi_{r,a}\) values for all 20 non-stop amino acids a sum to one (if using sitepreferences), or all of the other differential preferences are normalized so that the \(\Delta\pi_{r,a}\) values for all 20 non-stop amino acids a sum to zero (if using differentialpreferences). The site entropy or RMS differential preference is recomputed after this normalization. If includestop is True, then stop codons are considered a possible amino acid for the plots created here – but note that they are assigned a preference of zero if no preference is assigned in sitepreferences. In the sequence logo plot, stop codons are indicated with a black X character rather than an asterisk.
Often, you will not really be interested in stop codons as a potential amino acid at a site because you will assume that they are not viable amino acids as they truncate the protein. This could provide a justification for setting includestop to False. On the other hand, checking to see that stop codons have very low preferences can be a good sanity check to make sure that the overall analysis is giving reasonable inferences – in that case, you might set includestop to True and then check that you don’t have lots of black X characters in the logo plot.
sitenumbermapping is an optional argument that can be used to change how residues are numbering in the logo plot (it currently has no effect on
site_entropy_plot.pdf
). If you either don’t specify sitenumbermapping or set it to None, then nothing is done. Otherwise, you should set it to a value that specifies the name of an existing text file in CSV (comma-separated file) format. Here is an example of a few lines:#SEQUENTIAL,ALTERNATIVE 1,3 2,4 3,5 4,6 5,7 6,8
Briefly, any lines in the specified file that are empty or begin with a # character are ignored. For all other lines, there should be two comma-separated integer entries. The first column should provide the sequential number for sites for sites specified by siterange. The second number gives the number that is actually assigned to the residue in the logo plot. Any sites specified in the sequential numbering that are outside siterange are still excluded. So for instance, the example file above would mean that the first residue is numbered as 3, and so on.
ymax is a variable that is only meaningful if differentialpreferences is being used. The maximum sum of the differential preferences is plus one and minus one. However, in many cases the differential preferences may not sum to even close to this number for any sites. When this is the case, all of the logo stacks may look very small. So if you set ymax, the heights of the logo stacks will only extend to the value specified by ymax (so if you set 0.5, the stacks will go from -0.5 to 0.5). By default, ymax is one, so this is its value if you do not set this argument.
Example input file¶
Here is an example input file:
# Input file for mapmuts_siteprofileplots
sitepreferences combined_equilibriumpreferences.txt
outfileprefix WT-1_
siterange 2 498
dsspfile ./DSSP_analysis/2IQH_monomerC.dssp
dsspchain C
add_rsa True
add_ss True
nperline 83
includestop True
Output files¶
This script will write some brief output to standard out (sys.stdout) describing its progress.
It will then generate the following plots, each of which has the indicated suffix preceded by outfileprefix.
If sitepreferences is used (rather than differentialpreferences), then the output files are:
site_preferences_logoplot.pdf
: A sequence logo plot showing the preference for each amino acid. The height of each letter is proportional to the preference for that amino acid, \(\pi_{r,a}\). The amino-acid letters are colored according to the Kyte-Doolittle hydrophobicity scale. The plot will show RSAs, SSs, or custom properties in colored bars depending on the values of add_rsa, add_ss, and add_custom – sites that lack this information in dsspfile are white. Here is an example:
site_entropy_plot.pdf
: A plot of the site entropy as a function of the residue position in the primary sequence.
entropy_rsa_correlationplot.pdf
: This plot is only produced if the dsspfile option is being used. In this case, it is a plot showing the correlation between the site entropy and the RSA (as calculated by DSSP) for all sites for which both pieces of information are available. If scipy is available, the Pearson correlation coefficient is shown on the plot.
If differentialpreferences is used (rather than sitepreferences), then the output files are:
differentialpreferences_logoplot.pdf
: A sequence logo plot showing the differential preference for each amino acid at each site. The height of each letter is proportional to the differential preference \(\Delta\pi_{r,a}\) for that amino acid. Note that these differential preferences can be either positive or negative – positive numbers are above the center line for each logo stack, negative ones are below. By default the logo stacks extend to plus and minus one, but this can be changed with the ymax option described above. The amino acids are colored according to the Kyte-Doolittle hydrophobicity scale. The plot will show RSAs, SSs, or custom properties depending on the values of add_rsa, add_ss, and add_custom – these properties are shown in colored bars above the plot.RMS_differentialpreference_plot.pdf
: A plot showing the root-mean-square (RMS) differential site preference as a function of the residue position in the primary sequence.RMSdifferentialpreference_rsa_correlationplot.pdf
: This plot is only produced if the dsspfile option is being used. In this case, it is a plot showing the correlation between the RMS differential preference and the RSA (as calculated by DSSP) for all sites for which both pieces of information are available. If scipy is available, the Pearson correlaton coefficient is shown on the plot.