.. _mapmuts_makealignments.py: ====================================== mapmuts_makealignments.py ====================================== Script to align overlapping paired-end Illumina reads to a target gene. This is one of the core scripts of the `mapmuts`_ package. To run this script from the prompt, first create a text infile of the format described below. Then simply type :ref:`mapmuts_makealignments.py` followed by the infile name. For example, if the name is ``infile.txt``, type:: mapmuts_makealignments.py infile.txt This script aligns overlapping paired-end sequencing reads to a target gene sequence. It has been tested on the alignment to a roughly 1.5 kb gene of 50 nt paired-end Illumina reads in FASTQ files generated by the Casava 1.8 pipeline, although it should work more generally. The result of running this script is a variety of output files containing the actual alignments and summary information about the alignment process. If `pylab`_ / `matplotlib`_ are available, summary plots are also generated. If `pdflatex`_ is available, these plots are merged into a summary PDF document as well. This script runs according to an input text file, the format of which is described below. This text file provides options for specifying the stringency and parameters of the alignments. The alignment algorithm ----------------------------- Some important notes about how the alignments are performed: * The alignments performed by this script are not actually true global alignments, but rather direct searches for substrings allowing some mismatches. This is efficient for short reads that are being aligned to relatively short genes (a few kb). It will become increasingly less computationally efficient for longer reads or target genes. In addition, these alignment methods do not consider the possibility of gapped alignments. This tends not to be a problem for Illumina reads, where mismatches are the main type of sequencing error. It could be a problem for other types of sequencing data where indel errors are common. * N / n nucleotides do not count as mismatches, regardless of what they are aligned with. Instead, there is simply a limit on how many of these can occur in a read as specified by maxn described below. * The alignments are case-sensitive. In general, you will want to guarantee that the sequences are upper case. See the `upcase` option described in the description of the input file below. The general algorithm is as follows: 1) Paired Illumina reads (R1 and R2 reads) are read from a FASTQ file. If the `applyfilter` option is `True`, any reads that fail the Illumina chastity filter are discarded. In addition, any reads that fail the Q-score cutoff specified by `minq` are discarded. 2) The R1 and R2 reads are aligned to each other assuming that reads have the terminal adaptor sequences specified by `a1` and `a2`. The alignment must meet the standards specified by `maxn`, `minoverlap`, `maxrm`, `maxa1m`, and `maxa2m`. If the paired reads cannot be aligned to each other, they are discarded. 3) The overlapped region of the R1 and R2 reads is aligned to the full target sequence given by fullgene. If the region cannot be aligned with <= `maxgenem` mismatches, the reads are discarded. 4) The portion of the overlap to the gene range of interest (specified by `generange`) is written to an output file, assuming that some of the overlap is to the gene range of interest. 5) Other output text files are generated with statistics from the alignment process. 6) If `pylab`_ / `matplotlib`_ are available, summary plots are made. 7) If `pdflatex`_ is available, an overall summary of the alignment statistics is made. Format of the input file ---------------------------- The input file specifies `key` / `value` pairs on individual lines in the format:: key1 value1 key2 value2 Blank lines or lines that begin with # are ignored (i.e. as comment lines). The input file should contain the following keys: * The keys `r1files` and `r2files` specify the input FASTQ files, which can potentially be gzipped. After the key name should be a list of the name (including directory path if not in the current directory) of one or more FASTQ files. The `r1files` should contain the first (R1) reads of a paired-end set, and the `r2files` should contain the second (R2) reads of a paired-end set. Note that the entries in these files must be in the same order, and each must contain the same number of reads, so that when reads are read sequentially from `r1files` they pair with the reads read sequentially from `r2files`. In addition to specifically specifying the file names, you can also list general path information such as ``./my_directory/*_R1*_.fastq.gz`` and the ``*`` character will be expanded to mean any value as in the standard Linux file naming. In this case, the lists of files matching the specification for *r1files* and *r2files* must be the same, and when sorted in alphabetical order the files must pair with each other when the reads are read sequentially. If you are using this sort of general name specification then you can only list one name each for *r1files* and *r2files*. * `gzipped` is a Boolean switch specifying whether the FASTQ files specified by `r1files` and `r2files` are gzipped. If it is set to `True`, these files should be gzipped. If it is set to `False`, they should not be gzipped. * `applyfilter` is a Boolean switch specifying whether we discard read pairs in which either the R1 or R2 read failed the Illumina chastity filter. If `True`, all reads that failed the filter are discarded. If `False`, they are not. In general, you probably want this option to be `True` unless you have a good reason for disregarding the Illumina chastity filter. * `minq` specifies the minimum acceptable average Q-score (with the average taken over all positions in the read). Any read pair in which either read has an average Q-score < `minq` is discarded. While setting very high values of `minq` might be supposed to give more accurate reads, in practice we have found that high values of `minq` do not seem to improve accuracy much, apparently because problematic reads tend to be removed in a more sensitive and selective manner by setting reasonably stringent values for the other alignment parameters below. Therefore, a modest value such as 20 or 25 is probably fine. * `fullgenefile` specifies the name of the file containing the full target sequence to which the reads will be aligned. In general, this is the PCR amplicon that was fragmented to prepare the reads. Typically, this full amplicon sequence will be longer than the actual gene of interest, since the PCR creation of the amplicon often adds extra terminal sequences. For successful alignment, it is important that this file specifies the full amplicon sequence, not just the gene region of interest. That region is specified by the argument `generange`. `fullgenefile` should be a FASTA file containing a single entry. * `generange` specifies the portion of fullgene that contains the gene of interest. The gene sequence is numbered sequentially 1, 2, 3, ... The `generange` key should be followed by two numbers. The first number gives the number of the first nucleotide of `fullgene` that is part of the gene of interest, and the last number gives the number of last nucleotide that is part of the gene of interest. The terminal stop codon should not be included. * `a1file` specifies the name of the file containing the adaptor sequence that is found at the 3' end of any R1 reads that read past the insert and into the adaptor. `a2file` specifies the same for R2 reads. Both of these keys should specify FASTA files containing a single entry giving the adaptor. * `maxn` specifies the maximum number of N or n nucleotides that are allowed in either read of a read pair. If either read has > `maxn` N or n nucleotides, the read pair is discarded. * `minoverlap` specifies the minimum length of the overlap between R1 and R2, representing their joint coverage of the target sequence. The overlap between the two reads must be >= this number. * `maxrm` specifies the maximum number of mismatches allowed in the overlap of R1 and R2. The number of mismatches in the overlapped region of the two reads must be <= this number. * `maxa1m` specifies the maximum number of mismatches allowed in the overlap of R1 with adaptor `a1`. The number of mismatches must be <= this number. `maxa2m` specifies the same for R2 and its adaptor. * `maxgenem` specifies the maximum number of mismatches that is allowed for either read individually when it is aligned with `fullgene` after removing the read adaptor sequences (`a1` or `a2`). The read pair is discarded if either read has > this many mismatches when aligned with the full target gene. * `upcase` specifies how we handle possible upper / lower case differences in nucleotide codes. This is important since the alignment procedures are case-sensitive (i.e. a is different than A). If `upcase` is set to `True`, then all sequences are converted to upper case before alignment. This is the safest option, but is also the slowest since it requires so many case conversions. Another option is `test`, which means that the target and adaptor sequences (`fullgene`, `a1`, `a2`) are converted to upper case, and the FASTQ files are tested to make sure the first entry is upper case (as is typically the case for Illumina read). If the first entry is upper case, no case conversion is done on the reads, which will be faster. If the first entry is not upper case, then an exception is raised, and you will probably have to switch this option to `True`. * `outfileprefix` specifies the prefix appended to all output files created by this script. These output files are detailed in the next section below. Any existing files are overwritten. * `samplename` specifies the name of the sample for this run of the script. Name are allowed to contain spaces. * `write_unaligned` specifies that we write all unaligned reads to the ``*_unaligned.fasta.gz`` file. This is done if set to `True`, not done if set to `False`. Each entry also contains a brief explanation of why it failed to align. Here is an example input file:: # Example input file for mapmuts_makealignments.py maxa2m 1 generange 62 1555 write_unaligned True r2files /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/FASTQ_files/replicate_A/Sample_WT-1_DNA/*R2*.fastq.gz maxa1m 1 a1file /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/R1_trim3.fasta gzipped True minq 25 minoverlap 30 maxgenem 6 r1files /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/FASTQ_files/replicate_A/Sample_WT-1_DNA/*R1*.fastq.gz a2file /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/R2_trim3.fasta maxrm 1 outfileprefix replicate_A_WT-1_DNA fullgenefile /home/jbloom/mapmuts/examples/2013Analysis_Influenza_NP_Aichi68/Aichi68-NP_amplicon.fasta maxn 2 upcase test applyfilter True samplename replicate_A, WT-1, DNA Output files ---------------- Running this script produces a series of output files. The names of these files all include the prefix specified by `outfileprefix` in the input file, and then the indicated suffix. Any existing files are overwritten. * `outfileprefix`\_makealignments_log.txt : a text file that tracks the output of this script. * `outfileprefix`\_alignments.txt.gz : A gzipped text file containing the read pair to gene alignments for all read pairs that pass the various filters and cutoffs. Each alignment consists of five lines. These lines give: (1) the read name, (2) the alignment region of the gene with indices, (3) the alignment region of R1 with indices, (4) the alignment region of R2 with indices, (5) a blank line. The indices indicate the first and last nucleotide shown in 1, 2, ... numbering. For the read, a . character is used to indicate identities with the gene at that position. The alignment region is shown only for regions of gene (the portion of `fullgene` specified by `generange`) covered by both reads. For example:: @DH1DQQN1:241:C1433ACXX:2:1101:1623:2365 14 ATAGATACTAGAT 26 1 .....C.....T. 13 13 .....C....... 1 * `outfileprefix`\_alignmentstatistics.txt : A text file giving a summary of the alignment statistics (i.e. how many reads were successfully aligned and how many were filtered by the various cutoffs) in what should be a self-explanatory format. * `outfileprefix`\_insertlengths.txt : A text file giving the distribution of insert lengths for all R1-R2 read pairs that align with each other according to the given specifications. The shortest possible value of the insert length is `minoverlap`, while the longest possible value is the sume of the two read lengths minus `minoverlap`. * `outfileprefix`\_R1mismatches.txt : A text file giving the fraction and number of mismatches between the R1 read and the `fullgene` template for all positions in the read that align with `fullgene` after passing the various filtration criteria. Read positions are numbered 1, 2, ... * `outfileprefix`\_R2mismatches.txt : Like `_R1mismatches.txt`, but for R2 reads. * `outfileprefix`\_alignmentstatistics.pdf : Created if `pylab`_ / `matplotlib`_ is available, a graphical summary of data in `_alignmentstatistics.txt`. * `outfileprefix`\_insertlengths.pdf : Created if `pylab`_ / `matplotlib`_ is available, a graphical summary of data in `_insertlengths.txt`. * `outfileprefix`\_R1mismatches.pdf : Created if `pylab`_ / `matplotlib`_ is available, a graphical summary of data in `_R1mismatches.txt`. * `outfileprefix`\_R2mismatches.pdf : Created if `pylab`_ / `matplotlib`_ is available, a graphical summary of data in `_R2mismatches.txt`. * `outfileprefix`\_makealignments_summary.pdf : created if `pylab`_ / `matplotlib`_ and `pdflatex`_ are available, an overall graphical summary of the alignment statistics, insert lengths, mismatch positions, and filtering criteria. * `outfileprefix`\_unaligned.fasta.gz : created if `write_unaligned` is `True`, a gzipped FASTA file with all unaligned reads. Each entry also contains a brief explanation of why it failed to align. Example of summary plot --------------------------- Here is an example of the `outfileprefix`\_makealignments_summary.pdf plot. .. figure:: example_makealignments_summary.jpg :width: 60% :align: center :alt: example_makealignments_summary.jpg .. include:: weblinks.txt