2014 analysis of H1 influenza HAs descended from 1918 virus¶
Contents
This is a description of the analysis in The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin.
Overview¶
This is an analysis of influenza A virus H1 hemagglutinins (HAs) from human and swine that are descended from a common ancestor closely related the HA of the 1918 virus. This analysis constructs phylogenies of these sequences and analyzes them with standard and experimentally determined codon substitution models.
This analysis was performed by Jesse Bloom.
Much of the analysis itself is performed using HYPHY, which is run by phyloExpCM using the scripts detailed in the phyloExpCM documentation.
The code, data, and results from the analysis are in the ./examples/2014Analysis_Influenza_H1_HA/
subdirectory in the main phyloExpCM package directory. This subdirectory contains the necessary input files (see Input files) for running the analysis. It also contains a master script (run_analysis.py
) that can be used to regenerate the analysis with the command:
python run_analysis.py
provided that you have installed the software described in Software used.
Software used¶
The analysis was performed using the following software on the Fred Hutchinson computer cluster using the following Linux distribution:
Linux version 3.5.0-43-generic (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #66~precise1-Ubuntu SMP Thu Oct 24 14:52:23 UTC 2013
In order to replicate the analysis, you will need to use Linux or Mac OS X, and install the software listed below.
The specific versions of software used are specified. Other versions will likely work as well, but this has not been confirmed.
Python (version 2.7.3)
phyloExpCM (specifically, phyloExpCM version 0.32).
numpy (version 1.6.1)
matplotlib (version 1.3.1).
mapmuts (specifically, mapmuts version 1.01)
Biopython (version 1.5.8)
EMBOSS needle (version 6.6.0). The executable with the name
needle
is expected to be in the current search path.RAxML (version 7.7.6). The executable with the name
raxmlHPC-SSE3
is expected to be in the current search path. This is the SSE3 version, and was built using the command:make -f Makefile.SSE3.gcc
and then copying the executable into the search path.
Path-O-Gen (version 1.4) was used for the visual identification of outliers from the molecular clock and to root trees based on date-stamping.
FigTree (version 1.4.0) was used to open tree files for annotation, visualization, and saving of images.
codonPhyML (version 1.00 201306.18) was used to build phylogenetic trees. The executable with the name
codonphyml
is expected to be in the current search path. This executable was compiled with the omp options to make it multi-threaded.HYPHY (version HYPHY 2.220131214beta(MP) for Linux on x86_64) was used to optimize the branches and compute the likelihoods under the different models. The multiprocessor (
HYPHYMP
) version of the executable was used. To install this, the following commands were run in the unzipped HYPHY source code directory after download:cmake -DINSTALL_PREFIX=/home/jbloom/.local/ . make MP2 make install
This analysis was run on the FHCRC scientific computing cluster using sbatch (version 2.6.2), and requires sbatch to be installed to run as currently implemented. If you don’t want to use sbatch, all you need to do is modify the main
run_analysis.py
script to not submit any jobs via sbatch. The use of sbatch in the current version simply allows more processors to be used simultaneously and so speeds up the analysis.ImageMagick convert 6.6.9-7 2012-08-17 was used to convert PDF files to JPGs.
weblogo (version 3.3) was used to make the sequence logo plot.
PyPdf (version 1.13) was used to make overlays on the sequence logo plot.
Input files¶
Here are the input files that are necessary to repeat the analysis. These are the names assigned with the ./examples/2014Analysis_Influenza_H1_HA/
subdirectory of the main phyloExpCM package.
run_analysis.py
is a Python script that will perform the analysis assuming that you have all of the appropriate software (see Software used) and input files. To run this script, use:python run_analysis.py
at the command line.
WSN-HA.fasta
is the coding DNA sequence for the A/WSN/1933 (H1N1) HA as taken from reverse-genetics plasmid pHW184-HA.H1_HAseqs.fasta
is the set of all full-length influenza A HA coding DNA sequences of subtype H1 (excluding lab-adapted strains) as downloaded from the Influenza Virus Resource on March-29-2014.parse_H1_HumanSwine.py
is a Python scripts that extracts and aligns the HA sequences inH1_HAseqs.fasta
JVI_82_8947_Anomalies.txt
is a list of the strain names for the sequences identified as anomalous (either frozen in time or recombinant) in Appendices 1 and 2 of Krasnitz et al, 2008.JDB_Anomalies.txt
is a list of strain names that appear to be anomalous based on their strong violation of the molecular clock. These sequences are probably mis-annotated in the database, come from non-standard lineages (for examples a swine H3N2 isolated from humans), are reassortants, are extensively lab adapted and so deviate from the molecular clock due to large numbers of adaptation mutations, or are simply valid H1N1 sequences that don’t fall into the 1918 descended human or swine lineages.mutationrates.txt
is a file giving the nucleotide mutation rates for the influenza polymerase as described in Bloom 2014. Specifically, in that reference the mutation rates were determined as:- AG, TC : \(2.4 \times 10^{-5}\)
- GA, CT : \(2.3 \times 10^{-5}\)
- AC, TG : \(9.0 \times 10^{-6}\)
- CA, GT : \(9.4 \times 10^{-6}\)
- AT, TA : \(3.0 \times 10^{-6}\)
- GC, CG : \(1.9 \times 10^{-6}\)
There are six experimentally determined mutation rates, but in order for the overall substitution model to be reversible, the mutation rates must satisfy \(R_{C \rightarrow T} = \frac{R_{A \rightarrow G} \times R_{C \rightarrow A}}{R_{A \rightarrow C}}\) as described in the phyloExpCM documentation to
phyloExpCM_OptimizeDetectSelectionHyphy.py
. To satisfy this constraint will conforming to the experimentally measured values, we set the rates for A to C and C to A mutations equal to their average, and the same for A to G and G to A. This seems justified as these pairs of rates are nearly equal in the experimentally determined values. This gives the contents ofmutationrates.txt
:AC 9.2e-6 AG 2.35e-5 AT 3.0e-6 CA 9.2e-6 CG 1.9e-6
The site-specific amino-acid preferences for the WSN HA as inferred using mapmuts and described as part of the mapmuts analysis of the WSN HA from 2014. These are the preferences inferred with the mapmuts script
mapmuts_inferpreferences.py
. The specific files are:combined_amino_acid_preferences.txt
: combined data from all three experimental replicates.replicate_1_amino_acid_preferences.txt
: data from first experimental replicate.replicate_2_amino_acid_preferences.txt
: data from second experimental replicate.replicate_3_amino_acid_preferences.txt
: data from third experimental replicate.
./PDB_structure/
: subdirectory containing the results of analyzing the crystal structure of HA. Specific input files in this directory:1RVX.pdb
is the crystal structure 1RVX (downloaded from the Protein Data Bank) of the HA from A/PR/8/1934 (H1N1). This is the crystallized HA closest in identity to the WSN HA.1RVX_trimer.pdb
is a version of1RVX.pdb
that has been manually edited to contain only one trimer (chains A through F). The original1RVX.pdb
contains two trimers in the asymmetric unit.1RVX_trimer_renumbered.pdb
is a version of1RVX_trimer.pdb
that has been renumbered with PDB Goodies and so manual touch-up so that the numbering matches sequential 1, 2, ... numbering of the WSN HA sequence. The HA1 and HA2 chains have been merged into single chains (A, B, and C) and these chains are now numbered so that residue r corresponds to residue r in sequential numbering of the WSN HA sequence. Overall, this renumbered PDB contains 483 residues in each of the three HA chains, of which 438 are exact matches with the corresponding position in the WSN HA protein.1RVX_trimer_renumbered.dssp
is a text file containing the results of running the DSSP webserver on1RVX_trimer_renumbered.pdb
to determine the secondary structure and solvent accessibility of residues. The residues in this file match sequential 1, 2, ... numbering of the WSN HA protein sequence.sequential_to_H3.txt
is a file that converts between the sequential 1, 2, ... numbering of the WSN HA and the H3 numbering scheme as used in PDB 4HMG. This conversion was done using HA_numbering. The file contains comma-separated lines with the first entry giving the sequential number and the second giving the H3 number. For the H3 number, numbers alone indicate the HA1 chain and numbers followed by HA2 indicate the HA2 chain.
Running the analysis¶
The analysis in this directory can be repeated simply by running the master script run_analysis.py
from the command line with:
python run_analysis.py
or running it using sbatch with:
sbatch run_analysis.sbatch
Other than the files delineated under Input files, all data files in this directory can be re-generated with this command. However, the analysis approach does involve a few manual steps with Path-O-Gen and FigTree to create nicely formatted and annotated tree images from the raw tree files produced by the script. These manual steps are indicated below.
There is a variable called use_existing_output that is hardcoded into the beginning of this script. This variable is currently set to True, which means that if output already exists it is not overwritten and regenerated. If you want to regenerate the output (such as after changing input files or altering analysis parameters), then either manually delete the old output or change use_existing_output to False to force the script to overwrite old output. This script does not automatically test for dependencies among files, so if you change a file but keep use_existing_output to True, then you need to manually remove any old files with dependencies.
Note that run_analysis.py
uses multiple CPUs to speed up some of the analyses. If you don’t have as many CPUs on your system, you might want to check that you are not overloading the system – you can avoid this by using fewer CPUs, which will require you to manually modify the run_analysis.py
script.
Note that run_analysis.py
uses sbatch by default to run some of its subsidiary programs – if you don’t want to do this, set the use_sbatch option to False within the program.
Steps in the analysis¶
Here are the steps in the analysis. The command for each step is shown, but most of these commands (with the exception of some manual formatting of tree images with Path-O-Gen and FigTree) are run automatically by the main run_analysis.py
script if you are using that.
Construction of sequence sets¶
The first step is construction of alignment of H1 HA sequences. The ultimate set of aligned sequences are in aligned_HAs.fasta
.
Parsing and alignment of sequences¶
The alignment is done by the Python script parse_H1_HumanSwine.py
using the command:
python parse_H1_HumanSwine.py
Briefly, the script functions as follows:
All sequences in
H1_HAseqs.fasta
are read, and any sequences defined by anomalous (are listed inJVI_82_8947_Anomalies.txt
orJDB_Anomalies.txt
) are removed, as are sequences with ambiguous nucleotides, sequences that are not at least 1695 nucleotides in length, or sequences with headers that cannot be parsed.Of the remaining sequences, only the following are retained:
- Human H1N1 from 1918 to 1957, and from 1977 to 2014.
- Swine H1N1 from any year.
For these retained sequences, a maximum of three sequences per host and year. The sequences that are retained are the three unique sequences that have the most occurrences in
H1_HAseqs.fasta
in that host / year – the idea here is to avoid picking sequences with outlier mutations. If there are multiple sequences with equal numbers of occurrences, one is chosen randomly.The sequences are pairwise aligned with EMBOSS needle to the WSN HA (
WSN-HA.fasta
), and any gaps relative to this reference sequence are removed. The retained sequences are written toaligned_HAs.fasta
.
Molecular clock analysis for anomalous and off-lineage sequences¶
To identify and remove any anomalous sequences in aligned_HA.fasta
, an analysis is performed to make sure the sequences conform reasonably well to the molecular clock. Sequences that don’t conform could be misannotated, lab adapted, or (most frequently) are genuine H1N1 that are not from the major lineage (for example, European swine H1N1 or the 1976 Fort Dix swine flu in humans).
All sequences written by parse_seqs.py
to aligned_HAs.fasta
have their date pre-pended.
Human seasonal H1N1 reappeared in 1977 and persisted through 2008. The appearance was due to revival of the human H1N1 from about 1954. The dates for these sequences therefore need to have 24 years substracted in order to conform to a molecular clock (see dos Reis et al, 2009) – this is done by parse_seqs.py
An analysis for concordance with the molecular clock is performed using RAxML and Path-O-Gen. Specifically:
Build a RAxML tree (no date stamping) to allow visual inspection for possible outliers, and write this tree to the
RAxML_output
subdirectory:raxmlHPC-SSE3 -w ./RAxML_output -n aligned_HAs -p 1 -m GTRCAT -s aligned_HAs.fasta
The tree was then visually inspected using Path-O-Gen and FigTree, and clear outliers from the molecular clock were removed by adding them to
JDB_Anomalies.txt
and re-running the analysis.This was repeated until there were no further outliers.
In addition, sequences that appeared to be one-off zoonotic transfers (i.e. a single avian virus in an otherwise swine lineage, etc) were also added to
JDB_Anomalies.txt
.
Construction of phylogenetic trees¶
High-quality phylogenetic trees of the sequences in aligned_HAs.fasta
were constructed using codonPhyML.
The phyloExpCM script phyloExpCM_runcodonPhyML.py
(see phyloExpCM documentation for details) was used to run codonPhyML.
Two trees were built with two different codon substitution models:
The GY94 codon model (Goldman and Yang 1994) was used with the following setting:
- The equilibrium codon frequencies estimated emprically using the CF3x4 method.
- A single transition / transversion ratio (kappa) estimated by maximum likelihood.
- The dN/dS ratio (omega) drawn from four discrete gamma-distributed categories (Yang 1994) with the shape parameter and mean estimated by maximum likelihood.
The tree construction was performed in the
./CodonPhyML_Tree_H1_HumanSwine_GY94/
subdirectory of the example subdirectory. The tree itself is in the file./CodonPhyML_Tree_H1_HumanSwine_GY94/codonphyml_tree.newick
.The KOSI07 codon model (Kosiol et al 2007) was used with the following setting:
- The equilibrium codon frequencies estimated emprically using the F method.
- A single transition / transversion ratio (kappa) estimated by maximum likelihood.
- The dN/dS ratio (omega) drawn from four discrete gamma-distributed categories (Yang 1994) with the shape parameter and mean estimated by maximum likelihood.
The tree construction was performed in the
./CodonPhyML_Tree_H1_HumanSwine_KOSI07/
subdirectory of the example subdirectory. The tree itself is in the file./CodonPhyML_Tree_H1_HumanSwine_KOSI07/codonphyml_tree.newick
.
To visualize the tree, the following manual steps were performed:
The
codonphyml_tree.newick
tree in each of the two subdirectories was opened with Path-O-Gen, the tree was re-rooted according to the dates assigned to the tips, and the re-rooted tree was saved with the namerooted_tree.trees
.The
rooted_tree.trees
file was opened in FigTree, adjusted for attractive visual display, and saved to the fileannotated_tree.trees
.A PDF image of the tree in the
annotated_tree.trees
file was saved manually with FigTree, and then converted to a JPG with ImageMagick convert using the command:convert -density 300 annotated_tree.pdf annotated_tree.jpg
The tree images created by these steps are shown below. Note that these trees are extremely similar regardless of the whether the GY94 or KOSI07 model is used to infer the topology.
Computing evolutionary equilibrium frequencies¶
The expected evolutionary frequencies are different than the site-specific amino-acid preferences due to the structure of the genetic code (for example, lysine has six more codons than methionine, and so will be six times as abundant if the preferences are equal) as well as the mutation rates when these are non-symmetric.
These expected evolutionary equilibrium frequencies are computed using the phyloExpCM script phyloExpCM_builHyphyExpCM.py
, and are in the subdirectory ./evolutionary_frequencies/
.
Below is a logo plot showing these frequencies for sequential 1, 2, ... numbering of the WSN HA:
Correlations between preferences and evolutionary conservation¶
A very crude analysis is to examine the correlation between the amino-acid preferences and the frequency with which residues are present in the natural sequences that compose the phylogenetic trees shown in Construction of phylogenetic trees. The reason why this is a crude analysis is that evolution on this timescale is not ergodic, so we wouldn’t expect residues to be sampled in the trees according to their preferences even if those preferences were exactly correct. So the better analysis is really that done in the next section (Analysis of tree with various substitution models) which takes into account phylogeny. Nonetheless, it is simple and easy to compare the preferences and frequencies in the natural sequence alignments. This analysis is done in the subdirectory ./frequency_correlations/
.
First, the phyloExpCM script phyloExpCM_FreqsFromAlignment.py
is used to calculate the frequencies in the natural alignment (H1_HumanSwine_alignment.fasta
): these frequencies are reported in ./frequency_correlations/H1_HumanSwine_alignment_frequencies.txt
.
Next, the mapmuts script mapmuts_preferencescorrelate.py
is used to correlate these natural frequencies the amino-acid preferences. The results are plotted below. Although the correlation is far from perfect, it is substantial and highly significant.
Analysis of tree with various substitution models¶
The tree topologies were analyzed with experimentally determined and “conventional” codon substitution models.
For each of these substitution models, the tree topology was fixed but the branch lengths and model parameters were optimized to give the maximum likelihood. These optimizations were done using HYPHY run by the phyloExpCM script phyloExpCM_optimizeHyphyTree.py
(for conventional substitution models) or phyloExpCM_ExpModelOptimizeHyphyTree.py
(for experimentally determined models).
The conventional models used were variants of the GY94 codon model (Goldman and Yang 1994) and KOSI07 (Kosiol et al 2007) modes. For GY94, the CF3x4 method was used to determine the equilibrium codon frequencies (9 free parameters). For KOSI07, the F method was used to determine the equilibrium codon frequencies (60 free parameters). In all cases, kappa (the transition-transversion ratio) was estimated as a single values by maximum likelihood. For some variants, the rate was drawn from a discrete gamma distribution (4 categories) with shape parameter estimated by maximum likelihood. The following approaches were taken to estimate omega (the nonsynonymous / synonymous ratio):
- A single value estimated by maximum likelihood.
- Four discrete gamma categories with mean and shape parameters estimated by maximum likelihood.
- A different value for each branch estimated by maximum likelihood.
The experimentally determined substitution models used either the HalpernBruno or FracTolerated methods to compute the evolutionary models – these methods are described in the phyloExpCM documentation.
The results of the optimizations for the various substitution models are in the subdirectory ./codonmodel_optimized_trees/
. Within this subdirectory are further subdirectories containing the results of optimizing the branch lengths with HYPHY using the indicated codon substitution models. Files within these subdirectories give the results of the optimizations.
Overall summaries are in the following files:
H1_HumanSwine_GY94_summary.csv
: optimizations of the tree topology estimated using GY94.H1_HumanSwine_GY94_summary.tex
: like above, but in LaTex format.H1_HumanSwine_KOSI07_summary.csv
: optimizations of the tree topology estimated using KOSI07.H1_HumanSwine_KOSI07_summary.tex
: like above, but in LaTex format.
Here are the results for the GY94 tree topology:
#Summary for tree H1_HumanSwine_GY94.
#
#SUBSTITUTION_MODEL, dAIC, LOG_LIKELIHOOD, FREE_PARAMETERS, MAXIMUM_LIKELIHOOD_PARAMETERS, EMPIRICAL_PARAMETERS
combined-FracTolerated, 0, -24088.7, 0, 0, 0
replicate_3-FracTolerated, 303.2, -24240.3, 0, 0, 0
combined-HalpernBruno, 500.6, -24339, 0, 0, 0
replicate_1-FracTolerated, 535.4, -24356.4, 0, 0, 0
replicate_3-HalpernBruno, 657.8, -24417.6, 0, 0, 0
replicate_2-FracTolerated, 876.2, -24526.8, 0, 0, 0
GY94_CF3x4_omega-global-gamma4_rates-gamma4, 882.573, -24517, 13, 4, 9
replicate_1-HalpernBruno, 983.2, -24580.3, 0, 0, 0
GY94_CF3x4_omega-global-gamma4_rates-one, 1109.67, -24631.5, 12, 3, 9
replicate_2-HalpernBruno, 1190, -24683.7, 0, 0, 0
KOSI07_F_omega-global-gamma4_rates-gamma4, 1620.45, -24834.9, 64, 4, 60
GY94_CF3x4_omega-global-one_rates-gamma4, 1859.37, -25006.4, 12, 3, 9
KOSI07_F_omega-global-gamma4_rates-one, 1883, -24967.2, 63, 3, 60
KOSI07_F_omega-global-one_rates-gamma4, 2378.83, -25215.1, 63, 3, 60
GY94_CF3x4_omega-global-one_rates-one, 2544.53, -25350, 11, 2, 9
KOSI07_F_omega-global-one_rates-one, 3039.97, -25546.7, 62, 2, 60
combined-FracToleratedrandomized, 5632.8, -26905.1, 0, 0, 0
replicate_1-FracToleratedrandomized, 6002.4, -27089.9, 0, 0, 0
replicate_3-FracToleratedrandomized, 6138.8, -27158.1, 0, 0, 0
replicate_2-FracToleratedrandomized, 6477.8, -27327.6, 0, 0, 0
combined-HalpernBrunorandomized, 7072.8, -27625.1, 0, 0, 0
replicate_1-HalpernBrunorandomized, 7795, -27986.2, 0, 0, 0
replicate_3-HalpernBrunorandomized, 7891.8, -28034.6, 0, 0, 0
replicate_2-HalpernBrunorandomized, 8494.4, -28335.9, 0, 0, 0
Here are the results for the KOSI07 tree topology:
#Summary for tree H1_HumanSwine_KOSI07.
#
#SUBSTITUTION_MODEL, dAIC, LOG_LIKELIHOOD, FREE_PARAMETERS, MAXIMUM_LIKELIHOOD_PARAMETERS, EMPIRICAL_PARAMETERS
combined-FracTolerated, 0, -24082.5, 0, 0, 0
replicate_3-FracTolerated, 304.8, -24234.9, 0, 0, 0
combined-HalpernBruno, 494.4, -24329.7, 0, 0, 0
replicate_1-FracTolerated, 534.2, -24349.6, 0, 0, 0
replicate_3-HalpernBruno, 653.2, -24409.1, 0, 0, 0
replicate_2-FracTolerated, 869.4, -24517.2, 0, 0, 0
GY94_CF3x4_omega-global-gamma4_rates-gamma4, 876.673, -24507.8, 13, 4, 9
replicate_1-HalpernBruno, 976.8, -24570.9, 0, 0, 0
GY94_CF3x4_omega-global-gamma4_rates-one, 1100.99, -24621, 12, 3, 9
replicate_2-HalpernBruno, 1180.4, -24672.7, 0, 0, 0
KOSI07_F_omega-global-gamma4_rates-gamma4, 1609.02, -24823, 64, 4, 60
GY94_CF3x4_omega-global-one_rates-gamma4, 1856.23, -24998.6, 12, 3, 9
KOSI07_F_omega-global-gamma4_rates-one, 1867.29, -24953.1, 63, 3, 60
KOSI07_F_omega-global-one_rates-gamma4, 2367.9, -25203.4, 63, 3, 60
GY94_CF3x4_omega-global-one_rates-one, 2548.27, -25345.6, 11, 2, 9
KOSI07_F_omega-global-one_rates-one, 3027.99, -25534.5, 62, 2, 60
combined-FracToleratedrandomized, 5628, -26896.5, 0, 0, 0
replicate_1-FracToleratedrandomized, 5993.6, -27079.3, 0, 0, 0
replicate_3-FracToleratedrandomized, 6138, -27151.5, 0, 0, 0
replicate_2-FracToleratedrandomized, 6475.2, -27320.1, 0, 0, 0
combined-HalpernBrunorandomized, 7069.4, -27617.2, 0, 0, 0
replicate_1-HalpernBrunorandomized, 7786.8, -27975.9, 0, 0, 0
replicate_3-HalpernBrunorandomized, 7889.2, -28027.1, 0, 0, 0
replicate_2-HalpernBrunorandomized, 8496, -28330.5, 0, 0, 0
The models are ranked in order of AIC difference from the best model. Both tree topologies give nearly identical results. These results show that:
- The experimentally determined evolutionary models using the estimates from the three replicates combined (these are expected to be the best estimates) are clearly far superior to any of the conventional substitution models. This repeats the finding in Bloom 2014 that experimentally determined evolutionary models are superior.
- The models from the individual replicates are also generally better than the conventional models, but not by as large a margin. Presumably this is because estimating the evolutionary model from just one replicate is not as accurate.
- The FracTolerated model is superior to the HalpernBruno model (this was also found to be the case in Bloom 2014).
- Randomizing the preferences among sites makes the models far worse than conventional models.