epitopefinder.plot Module

Module for performing plotting.

This module uses pylab and matplotlib to make plots. Before running a function in this module, you should use the PylabAvailable function to determine if pylab and matplotlib are available. Otherwise, calling any other function will raise an Exception if thise modules are not available. The pdf backend is used for matplotlib / pylab. This means that plots must be created as PDF files.

A few functions also utilize scipy for calculations. Before using these functions, you should use ScipyAvailable to see if scipy is available. Otherwise an exception will be raised.

List of functions

PylabAvailable

CumulativeFractionPlot

Base10Formatter

SubsetPValue

SplitLabel

PlotLinearDensity

CorrelationPlot

PlotDistributionComparison

Details of functions

Provided in their individual documentation strings below.

plot.Base10Formatter(number, exp_cutoff, exp_decimal_digits, decimal_digits)

Converts a number into Latex formatting with scientific notation.

Takes a number and converts it to a string that can be shown in LaTex using math mode. It is converted to scientific notation if the criteria specified by exp_cutoff.

number the number to be formatted, should be a float or integer. Currently only works for numbers >= 0

exp_cutoff convert to scientific notation if abs(math.log10(number)) >= this.

exp_decimal_digits show this many digits after the decimal if number is converted to scientific notation.

decimal_digits show this many digits after the decimal if number is NOT converted to scientific notation.

The returned value is the LaTex’ string. If the number is zero, the returned string is simply ‘0’.

>>> Base10Formatter(103, 3, 1, 1)
'103.0'
>>> Base10Formatter(103.0, 2, 1, 1)
'1.0 \\times 10^{2}'
>>> Base10Formatter(103.0, 2, 2, 1)
'1.03 \\times 10^{2}'
>>> Base10Formatter(2892.3, 3, 1, 1) 
'2.9 \\times 10^{3}'
>>> Base10Formatter(0.0, 3, 1, 1) 
'0'
>>> Base10Formatter(0.012, 2, 1, 1)
'1.2 \\times 10^{-2}'
>>> Base10Formatter(-0.1, 3, 1, 1)
Traceback (most recent call last):
    ...
ValueError: number must be >= 0
plot.CorrelationPlot(xs, ys, plotfile, xlabel, ylabel, corr=None, title=False)

Plots the correlation between two variables as a scatter plot.

The data is plotted as a scatter plot.

This function uses pylab / matplotlib. It will raise an Exception if these modules cannot be imported (if PylabAvailable() == False).

The calling variables use LaTex format for strings. So for example, ‘$10^5$’ will print the LaTex equivalent of this string. Similarly, certain raw text strings (such as those including underscores) will cause problems if you do not escape the LaTex format meaning. For instance, ‘x_label’ will cause a problem since underscore is not valid outside of math mode in LaTex, so you would need to use ‘x_label’ to escape the underscore.

CALLING VARIABLES:

  • xs and ys are lists of numbers, with the lists being of the same length. Entry xs[i] is plotted on the x-axis agains entrie ys[i] on the y-axis.
  • plotfile is a string giving the name of the plot PDF file that we create. It should end in the extension .pdf. If this plot already exists, it is overwritten.
  • xlabel is a string giving the label placed on the x-axis.
  • ylabel is a string giving the label placed on the y-axis.
  • corr specifies if we calculate and include a correlation coefficient on the plot. If it is None, then no correlation is computed. Otherwise, the coefficient is calculated using scipy (so this requires ScipyAvailable() == True). In this case, corr should be set to the string Pearson (to calculate the Pearson linear correlation coefficient) or to the string Spearman (to calculate Spearman’s rho rank-order correlation). In both cases, the correlations are reported along with the two-tailed P-values. They are written on the plot.
  • title is a string giving the title placed above the plot. It can be False if no title is to be used. Otherwise, it should be the title string (using LaTex formatting, spaces are allowed). Is False by default.
plot.CumulativeFractionPlot(datalist, plotfile, title, xlabel)

Creates a cumulative fraction plot.

Takes a list of numeric data. Plots a cumulative fraction plot giving the fraction of the data points that are <= the indicated value.

datalist is a list of numbers giving the data for which we are computing the cumulative fraction plot. Raises an exception if this is an empty list.

plotfile is the name of the output plot file created by this method (such as ‘plot.pdf’). The extension must be ‘.pdf’.

title is a string placed above the plot as a title. Uses LaTex formatting.

xlabel is the label given to the X-axis. Uses LaTex formatting.

This function uses pylab / matplotlib. It will raise an Exception if these modules cannot be imported (if PylabAvailable() is False).

plot.PlotDistributionComparison(fullset, subset, fullsetname, subsetname, plotfile, xlabel, ylabel, title, nrandom, withreplacement, ymax=None)

Compares two distributions and tests if one has a greater mean.

This function can be generally used to compare and plot two distributions. Specifically, this function creates a plot of the distributions of integers in the two distributions fullset and subset. For generating this plot, there is no actual requirement that subset be a true subset of fullset.

However, if subset is a true subset of fullset, then this function can also calculate and display the P-value for the hypothesis that the mean of subset is greater than the mean of fullset.

This function uses pylab / matplotlib. It will raise an Exception if these modules cannot be imported (if PylabAvailable() == False).

The calling variables use LaTex format for strings. So for example, ‘$10^5$’ will print the LaTex equivalent of this string. Similarly, certain raw text strings (such as those including underscores) will cause problems if you do not escape the LaTex format meaning. For instance, ‘x_label’ will cause a problem since underscore is not valid outside of math mode in LaTex, so you would need to use ‘x_label’ to escape the underscore.

CALLING VARIABLES:

  • fullset is a list of integers giving the first data set.
  • subset is a list of integers giving the second data set. If you are using nrandom then subset should be a true subset of fullset (for the calculated P-value to make sense) and in this case there is a strict requirement that len(subset) < len(fullset).
  • fullsetname is a string giving the name used to label the distribution in fullset.
  • subsetname is a string giving the name used to label the distribution in subset.
  • plotfile is a string giving the name of the PDF plot file generated by this function. It must end in the extension .pdf. If this file already exists, it is overwritten.
  • xlabel is a string giving the label placed on the x-axis.
  • ylabel is a string giving the label placed on the y-axis.
  • title is a string that is used to label the plot. If it is set to an expression that evaluates to False, then no title is displayed.
  • nrandom specifies how we calculate the P-value that the mean of subset is < or >= the mean of len(subset) random samples drawn from fullset. If nrandom evaluates to False, then no P-value is computed or displayed. Otherwise, nrandom should give the name of random subsets of fullset that we test to compute the P-value. For example, a reasonable number might be nrandom=1e5. Whether the draws are done with or without replacement is specified by withreplacement.
  • withreplacement specifies how we calculate the P-value. The value of withreplacement is arbitrary if nrandom evaluates to False. Otherwise, withreplacement must be a bool variable of either True or False. If it is True, then the draws of the random subsets are done with replacement (so the same number can be drawn multiple times). If it is False, then the draws are done without replacement (so the same number is drawn at most once).
  • ymax is an optional argument setting the y-max of the plot. By default it is None, meaning that we let pylab find the best y-maximum. You can also specify a number if you want to set the y-maximum.
plot.PlotLinearDensity(datalist, plotfile, xlabel, ylabel, title=False, fixymax=False)

Plots linear density of variable as a function of primary sequence.

This function is designed to plot some variable (such as the number of epitopes as a function of the primary sequence position). It creates an output PDF plot plotfile.

The data is plotted as lines. If there is more than one data series to be plotted, a legend is included.

This function uses pylab / matplotlib. It will raise an Exception if these modules cannot be imported (if PylabAvailable() == False).

The calling variables use LaTex format for strings. So for example, ‘$10^5$’ will print the LaTex equivalent of this string. Similarly, certain raw text strings (such as those including underscores) will cause problems if you do not escape the LaTex format meaning. For instance, ‘x_label’ will cause a problem since underscore is not valid outside of math mode in LaTex, so you would need to use ‘x_label’ to escape the underscore.

CALLING VARIABLES:

  • datalist is a list specifying the data to plot. It should be a list of one or more 2-tuples of the form (label, data) where label is a string label used in the legend, and data is a list of 2-tuples (x, y) specifying the points to be plotted.
  • plotfile is a string giving the name of the plot PDF file that we create. It should end in the extension .pdf. If this plot already exists, it is overwritten.
  • xlabel is a string giving the label placed on the x-axis.
  • ylabel is a string giving the label placed on the y-axis.
  • title is a string giving the title placed above the plot. It can be False if no title is to be used. Otherwise, it should be the title string (using LaTex formatting, spaces are allowed). Is False by default.
  • fixymax means that we fix the y-maximum to the specified value. This may be useful if you are making multiple plots for comparisons between them, and want them all to have the same y-maximum. Note that the value specified here is taken to be the data maximum – the actually maximum of the y-axis is somewhat higher to provide some padding space. Is False by default.
plot.PylabAvailable()

Returns True if pylab/matplotlib available, False otherwise.

You should call this function to test for the availability of the pylab/matplotlib plotting modules before using other functions in this module.

plot.ScipyAvailable()

Returns True if scipy is available, False otherwise.

plot.SplitLabel(label, splitlen, splitchar)

Splits a string with a return if it exceeds a certain length.

label a string giving the label we might split.

splitlen the maximum length of a label before we attempt to split it.

splitchar the character added when splitting a label.

If len(label) > splitlen, we attempt to split the label in the middle by adding splitchar. The label is split as close to the middle as possible while splitting at a space.

No splitting as label length less than splitlen

>>> SplitLabel('WT virus 1', 10, '\n')
'WT virus 1'

Splitting of this label

>>> SplitLabel('WT plasmid 1', 10, '\n')
'WT\nplasmid 1'

Splitting of this label

>>> SplitLabel('mutated WT plasmid 1', 10, '\n')
'mutated WT\nplasmid 1'
plot.SubsetPValue(subset, fullset, nrandom, withreplacement)

Computes P-value that mean of subset is < or > than mean of fullset.

subset is a list of numbers.

fullset is a list of numbers with len(fullset) > len(subset)

nrandom is the number of random draws to use to compute the P-value.

withreplacement should be True or False. If True, the random draws are done with replacement (same value can be drawn multiple times). If False, the draws are done without replacement.

Computes the mean of the numbers in subset. Then performs nrandom draws of len(subset) samples (with or without replacement depending on the value of withreplacement) of fullset. Determines if the mean of the random subsets is < or >= to the mean in subset. If it is <, computes the fraction of random subsets where the random subsets have a mean >= subset. If it is >=, computes the fraction where the random subsets have a mean <= subset. Then returns the 2-tuple (gt_or_lt, fraction) where gt_or_lt is either “<” (if the mean of subset is >= to the random or “>”. So fraction represents the one-sided P-value for the hypothesis that subset has a mean > or < than the value of a random subset from fullset.

Table Of Contents

Previous topic

epitopefinder.align Module

Next topic

Acknowledgements

This Page