mapmuts_siteprofileplots.py

This program is no longer the state-of-the-art: You are suggested to instead
use the dms_tools program dms_logoplot.

This script is for the visual display of site-specific amino-acid preferences or differential preferences. The primary output of this script is a sequence logo plot. The script can also overlay some information about aspects of protein structure (secondary structure or solvent accessibility, as well as other user-specified information).

In the notation used here, \(\pi_{r,a}\) indicates the preference of site r for amino acid a, where \(\sum_a \pi_{r,a} = 1\) since each site always contains one of the amino acids. The differential preferences are denoted by \(\Delta\pi_{r,a}\) (where \(\sum_a\Delta\pi_{r,a} = 0\)), and indicate the change in preference for amino acid a at site r.

This script is designed to handle the amino-acid preferences in the *_equilibriumpreferences.txt files created by mapmuts_inferpreferences.py (it can also be used to analyze the *_equilibriumfreqs.txt files produced by mapmuts_inferenrichment.py). It can also handle the differential preferences in the *_differentialpreferences_selection_*.txt files output by mapmuts_inferdifferentialpreferences.py.

Note that the sequence logo plot produced by this script isn’t the standard logo plot in which the stack height is scaled according to the site entropy. Rather, the height of each stack is the same, and the height of amino acid a within a stack is proportional to \(\pi_{r,a}\) or \(\Delta\pi_{r,a}\). So the heights of the letters given the preference or differential preference of the site for that amino acid.

Dependencies

This script requires:

  • matplotlib to make the plots.
  • weblogo to make the sequence logo plots.
  • pyPdf to overlay the secondary structure and solvent accessibility information onto the sequence logos plots.
  • Optionally requires scipy to compute correlations (they will not be computed if scipy is not available).

Running the script

To run this script, simply create an Input file with the format described below. If you name your Input file siteprofileplots_infile.txt, then run the command:

mapmuts_siteprofileplots.py siteprofileplots_infile.txt

Input file

The input file is a text file with a series of key / value pairs. The required keys are indicated below. The values should not include spaces.

Lines beginning with # and empty lines are ignored.

Whether we are plotting site preferences or differential preferences is determined by whether you specify sitepreferences or differentialsitepreferences as a key in the input file.

Keys for the input file:

  • sitepreferences : You must specify either this key or differentialpreferences, but only one of them. If you use this key, it should specify the name of the file giving the site preferences \(\pi_{r,a}\). Typically this would be the *_equilibriumpreferences.txt file created by mapmuts_inferpreferences.py, although it could also be the *_equilibriumfreqs.txt file created by created by mapmuts_inferenrichment.py or mapmuts_enrichmentgeomeans.py. The file has a header, and then for each site r there is a line giving the site entropy \(h_r\) and all of the preferences \(\pi_{r,a}\). It is acceptable for a preference for a stop codon (denoted by a * character) to be either present or absent – but all other 20 amino acids must always be present. The site numbers listed in the file must be consecutive (i.e. there cannot be gaps) within siterange. Here is an example of a few lines of such a file:

    #SITE   WT_AA   SITE_ENTROPY    PI_A    PI_C    PI_D    PI_E    PI_F    PI_G    PI_H    PI_I    PI_K    PI_L    PI_M    PI_N    PI_P    PI_Q    PI_R    PI_S    PI_T    PI_V    PI_W    PI_Y    PI_*
    1   M   0.885694    0.000680047 0.00185968  8.48048e-06 0.00128217  0.0289946   0.000654589 0.0212144   0.0205306   0.0021597   0.00056726  0.876551    0.00130602  0.000789781 0.00144568  0.000414457 0.00071164  0.00126055  0.00167084  0.0352283   0.00176486  0.000905383
    2   A   1.1671  0.709575    0.00177747  3.79503e-05 0.00298854  0.00390524  0.00131246  0.00379599  0.000913006 0.00121497  0.000535209 0.00276335  0.00140061  0.000694694 0.0032134   0.00464972  0.000463924 0.000908368 0.254241    0.00335749  0.00137437  0.000876985
    3   N   3.2593  0.129992    0.000540239 4.2514e-06  0.00171887  0.0212062   0.0006086   0.102743    0.0431489   0.00524637  0.0771062   0.021714    0.0641921   0.00338082  0.0149898   0.0106522   0.13886 0.0247841   0.00607319  0.0500692   0.282389    0.000581555
    
  • differentialpreferences : You must specify either this key or sitepreferences, but only one of them. If you use this key, it should specify the name of a differentialpreferences_selection_*.txt file created by mapmuts_inferdifferentialpreferences.py,. The file has a header, and then for each site r there is a line giving the root-mean-square (RMS) differential preference \(\Delta\pi_{r,a}\) and all of the differentail preferences. It is acceptable for a differential preference for a stop codon (denoted by a * character) to be either present or absent – but all other 20 amino acids must always be prsent. The site numbers listed in the file must be consecutive (there cannot be gaps) with siterange. Here is an example of a few lines:

    #SITE   WT_AA   RMS_dPI dPI_A   dPI_C   dPI_D   dPI_E   dPI_F   dPI_G   dPI_H   dPI_I   dPI_K   dPI_L   dPI_M   dPI_N   dPI_P   dPI_Q   dPI_R   dPI_S   dPI_T   dPI_V   dPI_W   dPI_Y   dPI_*
    1   M   0.033508    -0.0041802  -0.00160924 -0.00102518 -0.00293463 -0.00112235 -0.000467141    -0.00262952 -0.0032969  -0.00679166 -0.00304567 0.0300719   0.00925138  -0.000442459    -8.11477e-05    -0.00129343 -0.00128639 0.000560046 -0.00243062 -0.00313798 -0.002528   -0.00158079
    2   K   0.0630635   -0.0159228  0.0128652   -0.00367557 -0.0167789  0.005023    -0.000661766    0.0260142   -0.0040351  0.0025629   0.00523521  0.00449474  0.00192782  0.0189381   0.0171668   -0.00881324 -0.00971018 0.00748405  0.0136674   -0.00289523 -0.0288229  -0.0240638
    
  • outfileprefix specifies the prefix affixed to each of the Output files detailed below. outfileprefix can include a directory as part of the prefix (such as ./plots/replicate-1_), but if such a directory is included in part of the prefix name then the directory must already exist or an error will be raised – this script does not create new directories. If you do not want to affix any prefix to the Output files, set outfileprefix to None. Any existing files that have the same names specified by outfileprefix will be overwritten.

  • siterange specifies the range of residues that are included in the plot. The value should be two numbers giving the first and the last residues to include in the plots. You would use this option if you want to exclude certain residues (such as the first one in the example above). If you want to include all of the residues, just set to this to be the string all. There must be residue numbers in sitepreferences for all sites in siterange.

  • dsspfile is an optional argument that allows you to compare the observed site entropies and preferences to the relative solvent accessibility (RSA) and the secondary structure for the sites. This will only be useful if a crystal structure for your protein is available. If you do not want to use this option, just set dsspfile to None or do not specify this key at all.

    You can then use the DSSP webserver to calculate the secondary structures and the RSAs. If you do want to use this option, then run DSSP on your protein structure – this script is tested against output from the DSSP webserver, but should probably work on output from the standalone too. Then save the DSSP output in a text file, and specify the path to that file as the value for dsspfile. This script does not currently have a robust ability to parse the DSSP output, so you have to do some careful checks. In particular, you must make sure that residue numbers in the PDB file exactly match the residue numbering scheme used for the rest of this analysis (i.e. the same residue numbers found in sitepreferences, and that none of the residue numbers contain letter suffixes (such as 24A) as is sometimes the case in PDB files. It is not necessary that all of the residues be present in the PDB. If there are multiple PDB chains, you can specify them using the dsspchain option. DSSP only calculates absolute accessible surface areas (ASAs); the RSAs are computed by normalizing by the maximum ASAs given by Tien et al, 2013.

  • dsspchain is an option that is only required if dsspfile is set to a value other than None. In this case, it should specify the chain in the PDB file that we want to use. If there is only one chain, you can set this option to None.

  • add_rsa is an option that is meaningful only if dsspfile is being used. This option specifies that the relative solvent accessibilities (RSAs) are shown in a color-coded bar above the sequence logo plot. If you do not want to do this, specify a value of False or simply do not specify this key. If you do want to show the RSA bar, then specify a value of True. The RSAs are the fractional solvent accessibilities. If you specify add_rsa to True, then you must also specify one and only one of either add_ss or add_custom as True.

  • add_ss is an option that is meaningful only if dsspfile is being used. This option specifies that the sequence logo plot contains a bar showing the secondary structure. If you do not want to do this, specify a value of False or simply do not specify this key. If you do want to show the secondary structure bar, then specify a value of True. The secondary structures are classified as helix (DSSP categories G, H, and I), strand (DSSP categories B and E) or loop (any other DSSP category). If add_ss is True, then add_rsa must be True and add_custom must be False.

  • add_custom is an option that can be used to plot a custom discrete property using bars above the sequence logo plot. You do not need to specify any option for add_custom – if nothing is specified, that is the same as setting add_custom to False. Currently, if add_custom is True then add_rsa must also be True and add_ss must be False. The usage of add_custom is as follows:

    add_custom antigenic antigenic_sites.txt binding binding_sites.txt
    

    In this example, add_custom is being used to label antigenic sites and receptor binding sites with the overlay bar. You can specify multiple properties with add_custom (in the above example, there are two). Each property should first have a name (in this example these are antigenic and binding) that does not contain a space. Each property name should then be followed by the name of a file (in this example, there are antigenic_sites.txt and binding_sites.txt) that also does not contain a space. In these files, any line that is empty or begins with a # character is ignored. All other lines specify sites to be labeled with property name. The site is specified by the first entry on the line, which should be the residue number to label. The line can have additional entries, but they are ignored. Each site can only be specified as belonging to one property, so in the above example a site cannot simultaneously be classified as antigenic and binding.

  • nperline specifies how many sites are included per line in the sequence logo plot. For example, if you have a protein of length 500 and you set nperline to 100, then the sequence logo plot will span 5 lines each showing 100 sites. In general, reasonable values are probably in the range from 50 to 100, but you might adjust based on the size of your protein.

  • includestop specifies whether we include stop codons as possible amino acid identities if they are specified in sitepreferences or differentialpreferences (stop codons are denoted by a * character). If includestop is False, then stop codons are not considered a possible amino acid in the plots created here, and all of the other preferences are normalized so that the \(\pi_{r,a}\) values for all 20 non-stop amino acids a sum to one (if using sitepreferences), or all of the other differential preferences are normalized so that the \(\Delta\pi_{r,a}\) values for all 20 non-stop amino acids a sum to zero (if using differentialpreferences). The site entropy or RMS differential preference is recomputed after this normalization. If includestop is True, then stop codons are considered a possible amino acid for the plots created here – but note that they are assigned a preference of zero if no preference is assigned in sitepreferences. In the sequence logo plot, stop codons are indicated with a black X character rather than an asterisk.

    Often, you will not really be interested in stop codons as a potential amino acid at a site because you will assume that they are not viable amino acids as they truncate the protein. This could provide a justification for setting includestop to False. On the other hand, checking to see that stop codons have very low preferences can be a good sanity check to make sure that the overall analysis is giving reasonable inferences – in that case, you might set includestop to True and then check that you don’t have lots of black X characters in the logo plot.

  • sitenumbermapping is an optional argument that can be used to change how residues are numbering in the logo plot (it currently has no effect on site_entropy_plot.pdf). If you either don’t specify sitenumbermapping or set it to None, then nothing is done. Otherwise, you should set it to a value that specifies the name of an existing text file in CSV (comma-separated file) format. Here is an example of a few lines:

    #SEQUENTIAL,ALTERNATIVE
    1,3
    2,4
    3,5
    4,6
    5,7
    6,8
    

    Briefly, any lines in the specified file that are empty or begin with a # character are ignored. For all other lines, there should be two comma-separated integer entries. The first column should provide the sequential number for sites for sites specified by siterange. The second number gives the number that is actually assigned to the residue in the logo plot. Any sites specified in the sequential numbering that are outside siterange are still excluded. So for instance, the example file above would mean that the first residue is numbered as 3, and so on.

  • ymax is a variable that is only meaningful if differentialpreferences is being used. The maximum sum of the differential preferences is plus one and minus one. However, in many cases the differential preferences may not sum to even close to this number for any sites. When this is the case, all of the logo stacks may look very small. So if you set ymax, the heights of the logo stacks will only extend to the value specified by ymax (so if you set 0.5, the stacks will go from -0.5 to 0.5). By default, ymax is one, so this is its value if you do not set this argument.

Example input file

Here is an example input file:

# Input file for mapmuts_siteprofileplots
sitepreferences combined_equilibriumpreferences.txt
outfileprefix WT-1_
siterange 2 498
dsspfile ./DSSP_analysis/2IQH_monomerC.dssp
dsspchain C
add_rsa True
add_ss True
nperline 83
includestop True

Output files

This script will write some brief output to standard out (sys.stdout) describing its progress.

It will then generate the following plots, each of which has the indicated suffix preceded by outfileprefix.

If sitepreferences is used (rather than differentialpreferences), then the output files are:

  • site_preferences_logoplot.pdf : A sequence logo plot showing the preference for each amino acid. The height of each letter is proportional to the preference for that amino acid, \(\pi_{r,a}\). The amino-acid letters are colored according to the Kyte-Doolittle hydrophobicity scale. The plot will show RSAs, SSs, or custom properties in colored bars depending on the values of add_rsa, add_ss, and add_custom – sites that lack this information in dsspfile are white. Here is an example:

    example_2013Analysis_Influenza_NP_Aichi68_p1_site_preferences_logoplot.jpg
  • site_entropy_plot.pdf : A plot of the site entropy as a function of the residue position in the primary sequence.

  • entropy_rsa_correlationplot.pdf : This plot is only produced if the dsspfile option is being used. In this case, it is a plot showing the correlation between the site entropy and the RSA (as calculated by DSSP) for all sites for which both pieces of information are available. If scipy is available, the Pearson correlation coefficient is shown on the plot.

If differentialpreferences is used (rather than sitepreferences), then the output files are:

  • differentialpreferences_logoplot.pdf : A sequence logo plot showing the differential preference for each amino acid at each site. The height of each letter is proportional to the differential preference \(\Delta\pi_{r,a}\) for that amino acid. Note that these differential preferences can be either positive or negative – positive numbers are above the center line for each logo stack, negative ones are below. By default the logo stacks extend to plus and minus one, but this can be changed with the ymax option described above. The amino acids are colored according to the Kyte-Doolittle hydrophobicity scale. The plot will show RSAs, SSs, or custom properties depending on the values of add_rsa, add_ss, and add_custom – these properties are shown in colored bars above the plot.
  • RMS_differentialpreference_plot.pdf : A plot showing the root-mean-square (RMS) differential site preference as a function of the residue position in the primary sequence.
  • RMSdifferentialpreference_rsa_correlationplot.pdf : This plot is only produced if the dsspfile option is being used. In this case, it is a plot showing the correlation between the RMS differential preference and the RSA (as calculated by DSSP) for all sites for which both pieces of information are available. If scipy is available, the Pearson correlaton coefficient is shown on the plot.