mapmuts_makealignments.py

Script to align overlapping paired-end Illumina reads to a target gene.

This is one of the core scripts of the mapmuts package.

To run this script from the prompt, first create a text infile of the format described below. Then simply type mapmuts_makealignments.py followed by the infile name. For example, if the name is infile.txt, type:

mapmuts_makealignments.py infile.txt

This script aligns overlapping paired-end sequencing reads to a target gene sequence. It has been tested on the alignment to a roughly 1.5 kb gene of 50 nt paired-end Illumina reads in FASTQ files generated by the Casava 1.8 pipeline, although it should work more generally.

The result of running this script is a variety of output files containing the actual alignments and summary information about the alignment process. If pylab / matplotlib are available, summary plots are also generated. If pdflatex is available, these plots are merged into a summary PDF document as well.

This script runs according to an input text file, the format of which is described below. This text file provides options for specifying the stringency and parameters of the alignments.

The alignment algorithm

Some important notes about how the alignments are performed:

  • The alignments performed by this script are not actually true global alignments, but rather direct searches for substrings allowing some mismatches. This is efficient for short reads that are being aligned to relatively short genes (a few kb). It will become increasingly less computationally efficient for longer reads or target genes. In addition, these alignment methods do not consider the possibility of gapped alignments. This tends not to be a problem for Illumina reads, where mismatches are the main type of sequencing error. It could be a problem for other types of sequencing data where indel errors are common.
  • N / n nucleotides do not count as mismatches, regardless of what they are aligned with. Instead, there is simply a limit on how many of these can occur in a read as specified by maxn described below.
  • The alignments are case-sensitive. In general, you will want to guarantee that the sequences are upper case. See the upcase option described in the description of the input file below.

The general algorithm is as follows:

  1. Paired Illumina reads (R1 and R2 reads) are read from a FASTQ file. If the applyfilter option is True, any reads that fail the Illumina chastity filter are discarded. In addition, any reads that fail the Q-score cutoff specified by minq are discarded.
  2. The R1 and R2 reads are aligned to each other assuming that reads have the terminal adaptor sequences specified by a1 and a2. The alignment must meet the standards specified by maxn, minoverlap, maxrm, maxa1m, and maxa2m. If the paired reads cannot be aligned to each other, they are discarded.
  3. The overlapped region of the R1 and R2 reads is aligned to the full target sequence given by fullgene. If the region cannot be aligned with <= maxgenem mismatches, the reads are discarded.
  4. The portion of the overlap to the gene range of interest (specified by generange) is written to an output file, assuming that some of the overlap is to the gene range of interest.
  5. Other output text files are generated with statistics from the alignment process.
  6. If pylab / matplotlib are available, summary plots are made.
  7. If pdflatex is available, an overall summary of the alignment statistics is made.

Format of the input file

The input file specifies key / value pairs on individual lines in the format:

key1 value1
key2 value2

Blank lines or lines that begin with # are ignored (i.e. as comment lines). The input file should contain the following keys:

  • The keys r1files and r2files specify the input FASTQ files, which can potentially be gzipped. After the key name should be a list of the name (including directory path if not in the current directory) of one or more FASTQ files. The r1files should contain the first (R1) reads of a paired-end set, and the r2files should contain the second (R2) reads of a paired-end set. Note that the entries in these files must be in the same order, and each must contain the same number of reads, so that when reads are read sequentially from r1files they pair with the reads read sequentially from r2files. In addition to specifically specifying the file names, you can also list general path information such as ./my_directory/*_R1*_.fastq.gz and the * character will be expanded to mean any value as in the standard Linux file naming. In this case, the lists of files matching the specification for r1files and r2files must be the same, and when sorted in alphabetical order the files must pair with each other when the reads are read sequentially. If you are using this sort of general name specification then you can only list one name each for r1files and r2files.

  • gzipped is a Boolean switch specifying whether the FASTQ files specified by r1files and r2files are gzipped. If it is set to True, these files should be gzipped. If it is set to False, they should not be gzipped.

  • applyfilter is a Boolean switch specifying whether we discard read pairs in which either the R1 or R2 read failed the Illumina chastity filter. If True, all reads that failed the filter are discarded. If False, they are not. In general, you probably want this option to be True unless you have a good reason for disregarding the Illumina chastity filter.

  • minq specifies the minimum acceptable average Q-score (with the average taken over all positions in the read). Any read pair in which either read has an average Q-score < minq is discarded. While setting very high values of minq might be supposed to give more accurate reads, in practice we have found that high values of minq do not seem to improve accuracy much, apparently because problematic reads tend to be removed in a more sensitive and selective manner by setting reasonably stringent values for the other alignment parameters below. Therefore, a modest value such as 20 or 25 is probably fine.

  • fullgenefile specifies the name of the file containing the full target sequence to which the reads will be aligned. In general, this is the PCR amplicon that was fragmented to prepare the reads. Typically, this full amplicon sequence will be longer than the actual gene of interest, since the PCR creation of the amplicon often adds extra terminal sequences. For successful alignment, it is important that this file specifies the full amplicon sequence, not just the gene region of interest. That region is specified by the argument generange. fullgenefile should be a FASTA file containing a single entry.

  • generange specifies the portion of fullgene that contains the gene of interest. The gene sequence is numbered sequentially 1, 2, 3, ... The generange key should be followed by two numbers. The first number gives the number of the first nucleotide of fullgene that is part of the gene of interest, and the last number gives the number of last nucleotide that is part of the gene of interest. The terminal stop codon should not be included.

  • a1file specifies the name of the file containing the adaptor sequence that is found at the 3’ end of any R1 reads that read past the insert and into the adaptor. a2file specifies the same for R2 reads. Both of these keys should specify FASTA files containing a single entry giving the adaptor.

  • maxn specifies the maximum number of N or n nucleotides that are allowed in either read of a read pair. If either read has > maxn N or n nucleotides, the read pair is discarded.

  • minoverlap specifies the minimum length of the overlap between R1 and R2, representing their joint coverage of the target sequence. The overlap between the two reads must be >= this number.

  • maxrm specifies the maximum number of mismatches allowed in the overlap of R1 and R2. The number of mismatches in the overlapped region of the two reads must be <= this number.

  • maxa1m specifies the maximum number of mismatches allowed in the overlap of R1 with adaptor a1. The number of mismatches must be <= this number. maxa2m specifies the same for R2 and its adaptor.

  • maxgenem specifies the maximum number of mismatches that is allowed for either read individually when it is aligned with fullgene after removing the read adaptor sequences (a1 or a2). The read pair is discarded if either read has > this many mismatches when aligned with the full target gene.

  • upcase specifies how we handle possible upper / lower case differences in nucleotide codes. This is important since the alignment procedures are case-sensitive (i.e. a is different than A). If upcase is set to True, then all sequences are converted to upper case before alignment. This is the safest option, but is also the slowest since it requires so many case conversions. Another option is test, which means that the target and adaptor sequences (fullgene, a1, a2) are converted to upper case, and the FASTQ files are tested to make sure the first entry is upper case (as is typically the case for Illumina read). If the first entry is upper case, no case conversion is done on the reads, which will be faster. If the first entry is not upper case, then an exception is raised, and you will probably have to switch this option to True.

  • outfileprefix specifies the prefix appended to all output files created by this script. These output files are detailed in the next section below. Any existing files are overwritten.

  • samplename specifies the name of the sample for this run of the script. Name are allowed to contain spaces.

  • write_unaligned specifies that we write all unaligned reads to the

    *_unaligned.fasta.gz file. This is done if set to True, not done if set to False. Each entry also contains a brief explanation of why it failed to align.

Here is an example input file:

# Example input file for mapmuts_makealignments.py
maxa2m 1
generange 62 1555
write_unaligned True
r2files /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/FASTQ_files/replicate_A/Sample_WT-1_DNA/*R2*.fastq.gz
maxa1m 1
a1file /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/R1_trim3.fasta
gzipped True
minq 25
minoverlap 30
maxgenem 6
r1files /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/FASTQ_files/replicate_A/Sample_WT-1_DNA/*R1*.fastq.gz
a2file /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/R2_trim3.fasta
maxrm 1
outfileprefix replicate_A_WT-1_DNA
fullgenefile /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/Aichi68-NP_amplicon.fasta
maxn 2
upcase test
applyfilter True
samplename replicate_A, WT-1, DNA

Output files

Running this script produces a series of output files. The names of these files all include the prefix specified by outfileprefix in the input file, and then the indicated suffix. Any existing files are overwritten.

  • outfileprefix_makealignments_log.txt : a text file that tracks the output of this script.

  • outfileprefix_alignments.txt.gz : A gzipped text file containing the read pair to gene alignments for all read pairs that pass the various filters and cutoffs. Each alignment consists of five lines. These lines give:

    1. the read name,
    2. the alignment region of the gene with indices,
    3. the alignment region of R1 with indices,
    4. the alignment region of R2 with indices,
    5. a blank line.

    The indices indicate the first and last nucleotide shown in 1, 2, ... numbering. For the read, a . character is used to indicate identities with the gene at that position. The alignment region is shown only for regions of gene (the portion of fullgene specified by generange) covered by both reads. For example:

    @DH1DQQN1:241:C1433ACXX:2:1101:1623:2365
    14 ATAGATACTAGAT 26
    1 .....C.....T. 13
    13 .....C.......  1
    
  • outfileprefix_alignmentstatistics.txt : A text file giving a summary of the alignment statistics (i.e. how many reads were successfully aligned and how many were filtered by the various cutoffs) in what should be a self-explanatory format.

  • outfileprefix_insertlengths.txt : A text file giving the distribution of insert lengths for all R1-R2 read pairs that align with each other according to the given specifications. The shortest possible value of the insert length is minoverlap, while the longest possible value is the sume of the two read lengths minus minoverlap.

  • outfileprefix_R1mismatches.txt : A text file giving the fraction and number of mismatches between the R1 read and the fullgene template for all positions in the read that align with fullgene after passing the various filtration criteria. Read positions are numbered 1, 2, ...

  • outfileprefix_R2mismatches.txt : Like _R1mismatches.txt, but for R2 reads.

  • outfileprefix_alignmentstatistics.pdf : Created if pylab / matplotlib is available, a graphical summary of data in _alignmentstatistics.txt.

  • outfileprefix_insertlengths.pdf : Created if pylab / matplotlib is available, a graphical summary of data in _insertlengths.txt.

  • outfileprefix_R1mismatches.pdf : Created if pylab / matplotlib is available, a graphical summary of data in _R1mismatches.txt.

  • outfileprefix_R2mismatches.pdf : Created if pylab / matplotlib is available, a graphical summary of data in _R2mismatches.txt.

  • outfileprefix_makealignments_summary.pdf : created if pylab / matplotlib and pdflatex are available, an overall graphical summary of the alignment statistics, insert lengths, mismatch positions, and filtering criteria.

  • outfileprefix_unaligned.fasta.gz : created if write_unaligned is True, a gzipped FASTA file with all unaligned reads. Each entry also contains a brief explanation of why it failed to align.

Example of summary plot

Here is an example of the outfileprefix_makealignments_summary.pdf plot.

example_makealignments_summary.jpg