Primary Functions
call_mutations()
Overview
Description: The
call_mutations()
function reads in the variant calls from the vcf file, translates the associated amino acids (for mutations within genes), and indentifies any variants relative to the reference.Common Options: Required options to
call_mutations()
include sample.dir, which is the path to a directory storing one or more vcf files (one for each sample to analyze). There should be no other files in this directory. Also required is information on the reference genome, which can be passed using the combination of fasta.genome and bed options. Alternatively, if working with SARS-CoV-2, the reference option can be set to “Wuhan” to use a pre-formatted reference. To report information (primarily sequencing depth) on all mutations of interest and not just those that are observed in the samples, the write.all.targets option can be set to TRUE, and the lineage-associated mutations file including the optional “Chr” and “Pos” columns (see above) must be specified with lineage.muts.Output: Mutation data are reported in a data frame that by default contains the columns SAMP_NAME, CHR, POS, ALT_ID, AF, & DP, though a number of additional columns can be included (see “Output” section below, and out.cols argument to
call_mutations()
). This object (data frame) can also be written to a file, which is especially relevant if you’re analyzing a large number of samples or if the genome is large (causing longer run time forcall_mutations()
; see write.mut.table). The output fromcall_mutations()
is used as input for the functionsexplore_mutations()
orestimate_lineages()
.
Methods
The call_mutations()
function uses one or more vcf files as input, along with a MixviR reference object (data frame) and creates a data frame/table that stores all mutations identified in the sample(s), along with a customizable set of associated information about each mutation. This data frame can be written to a file, and/or saved as an object in the global environment. In either case it is used as input to the explore_mutations()
or estimate_lineages()
functions.
call_mutations()
first obtains the MixviR reference object (Fig 6), which is created as part of the run if the fasta.genome and bed options are provided. Alternatively, if analyzing SARS-CoV-2, the reference option can be set to “Wuhan” and a pre-constructed reference will be used. In the case of overlapping genes, positions will be duplicated in this reference object, with a separate entry for each gene the nucleotide position is associated with.
MixviR then reads in the set of files to be analyzed. These should be the only files stored in the directory given by the sample.dir option. In most cases, samples will be provided in variant call format (vcf). These vcf files need to include the DP and AD flags in the FORMAT field. Relevant information from each vcf file is extracted with functionality from vcfR (Knaus and Grünwald, 2017). If the write.all.targets option will be used to report sequencing depths for genomic positions associated with a priori-defined mutations that don’t occur in the sample, all positions should be included in the input vcf file(s). Otherwise, only variant positions need to be included.
MixviR loops over the set of input files, sequentially calling mutations from each and appending them to a master data frame that stores all mutations. The process of calling mutations for each sample includes several steps. The overall sequencing depths at each position in the input file are first added to the reference object (Fig. 7, column ‘DP’; note that all objects shown in Figs 7-10 are temporary objects created during a MixviR analysis and are not directly available to the user). Depths are ‘NA’ for any positions not in the vcf input file.
Variable sites in the sample (sites with an entry in the ALT column of the input file) are then extracted, the frequency of the ALT allele (AF) is calculated for each, and the set of variants is filtered to retain those that exceed the value of min.alt.freq. This filtering step is intended to remove low-frequency sequencing noise and/or PCR artifacts from the data (see note on min.alt.freq in section Cautions/Important Considerations below). Deletions are represented as strings >1 bp in the REF column (Fig 8; position 11282 is a 9-bp deletion) and insertions are represented as strings >1 bp in the ALT column (Fig 8; 1 bp insertion at position 19983).
Next, MixviR checks the set of sample variants to determine if there are any genomic positions where more than one unique variant occurs in the (potentially mixed) sample. If not, mutations are characterized in two steps, first by identifying those based on single nucleotide polymorphism (SNP) variation, and subsequently by identifying mutations arising from insertions or deletions. Each of these steps is described below. In the event one or more positions have multiple mutations, the duplicated sites are first removed and stored. Mutations are initially called from the first set of variants (all variants without duplicated positions and the first variant at each duplicated site), and the process is iterated on the stored set of duplicated variants until no duplicated sites remain.
SNP-Based Mutation Identification
For mutations based on SNP variation, the full set of variants identified in the sample is filtered to retain just SNP-based variants (single nucleotide changes, no indels). These are then joined with the reference object, replacing the reference base with any alternate alleles observed in the sample to create a “sample genome” (in other words, the reference genome with SNP variants substituted in to their respective position). This updated sample genome is then translated using functionality from Biostrings (Pagès et al 2019) to get the sample amino acids. Mutations resulting in amino acid substitutions are retained and named in the form “S_D614G”, which would indicate a substitution of ‘G’ for the original ‘D’ at amino acid position 614 of the ‘S’ gene. Subsequently, all SNP variants that don’t result in an amino acid change, including synonymous changes or variants outside of genes, are identified and named in the form Chr1_500A->T, which would represent a mutation from ‘A’ to ‘T’ at nucleotide position 500 of Chromosome 1. The example below (Fig 9) shows one SNP in a non-genic region, one synonymous and one non-synonymous SNP-based mutation in the ‘ORF1a’ gene, and one synonymous and one non-synonymous SNP-based mutation in the ‘S’ gene.
Indel-Based Mutation Identification
Indel-based mutation identification is performed separately for insertions and deletions, and for each, separately for in-frame and frameshift indels (defined as indels for which the number of nucleotides gained or lost is an even multiple of 3 or not, respectively, regardless of whether the event occurred in a gene). MixviR uses the following rules for naming indels…
In-frame deletions
In-frame deletions (even multiples of 3bp) are indicated with ‘del’.
Examples: S_del144/144 (genic); ORF1a_del3675/3677 (genic); NC-045512.2_del23121/3bp (non-genic)
Genic: Deletions starting at a first codon position of a gene are named starting with the first AA deleted and ending with the last. Those that start at a 2nd or 3rd codon position are named with the downstream (3’)-most AA affected in the case of deletion of a single AA, or, if multiple AA’s are deleted, naming starts with the AA associated with the first full codon deleted, and extends to the last codon/AA affected. Below is an example of the latter scenario from the S gene of SARS-CoV-2, in which 6 bases (AGTTCA) are deleted from positions 22029-22034 (beginning at codon position 2 of amino acid 156 of the S gene).
The original sequence, broken into its codons and starting at position 22025, is…
Ref Sequence: AGT GAG TTC AGA GTT
Ref Amino Acids: S E F R V
Ref Amino Acid Position: 155 156 157 158 159
The corresponding deleted sequence is…
Deletion Sequence: AGT GGA GTT
Deletion Amino Acids: S G V
There is some ambiguity regarding how to name this event. It could be named “S_del156/157”, with a resulting substitution of “R158G”. Alternatively, it could be named “S_del157/158” with a resulting substitution of “E156G”. As described above, MixviR numbers the deletion beginning with the amino acid corresponding to the first full codon deleted - in this case “S_del157/158”, and it does not call the substitution, as both the deletion and substitution represent just one mutational (evolutionary) event.
- Non-genic: In-frame deletions occurring outside of genes are named with the chromosome, position (nucleotide) along the chromosome of the first deleted base, and the length of the deletion in bp.
Out-of-frame deletions
Out-of-frame deletions (not even multiples of 3bp deleted) are indicated with ‘Fdel’.
Examples: ORF1a_Fdel2454/7bp (genic); NC-045512.2_del23121/2bp (non-genic)
Genic: Named with the amino acid position within the gene where the first base is deleted. This is followed by the length of the deletion in bp.
Non-genic: Named with the chromosome, position (nucleotide) along the chromosome of the first deleted base, and the length of the deletion in bp.
In-frame insertions
In-frame insertions (even multiples of 3bp) are indicated with ‘ins’.
Examples: S_ins214/216 (genic); NC-045512.2_ins23121/3bp (non-genic)
Genic: Named with the amino acid position within the gene where the first base is inserted, followed by the (new) position of the last amino acid inserted.
Non-genic: Named with the chromosome, position (nucleotide) along the chromosome of the first inserted base, and the length of the insertion in bp.
Out-of-frame insertions
Examples: S_Fins649/1bp (genic); NC-045512.2_Fins23121/1bp (non-genic)
Out-of-frame insertions (not even multiples of 3bp deleted) are indicated with ‘Fins’.
Genic: Named with the amino acid position within the gene where the first base is inserted. This is followed by the length of the insertion in bp.
Non-genic: Named with the chromosome, position (nucleotide) along the chromosome of the first inserted base, and the length of the insertion in bp.
Indel examples are given in Figure 10, which shows a one-bp deletion within ORF1a at position 4947 of chromosome NC045512.2, a deletion of amino acid positions 3674-3676 in the ORF1a gene, a 1bp insertion within amino acid position 2173 of ORF1b, and a deletion of amino acid position 212 of the S gene.
Nonsense Mutation Identification
Nonsense mutations resulting in a premature stop codon are designated with an asterisk (i.e. ORF1a_R718*).
Output
The full set of mutations identified (SNP and indel-based) are joined to create a data frame that serves as the output returned by call_mutations()
. Any of the following columns can be included in this data frame (defined by the out.cols option):
- CHR: Chromosome.
- POS: Genomic position along the chromosome.
- GENE: Gene, if site is associated with a gene, otherwise listed as “non-genic”.
- STRAND: If part of a gene, strand the gene is on (+, -, .).
- REF_CODON: If in a gene, reference codon the position is part of.
- REF_AA: If in a gene, identity of the reference amino acid the position is associated with.
- GENE_AA_POSITION: If in a gene, amino acid position within the gene the nucleotide position is associated with.
- REF: Reference nucleotide, or if representing a deletion, the reference nucleotide plus deleted bases.
- ALT: Alternate nucleotide, or if representing an insertion, the reference nucleotide plus inserted bases.
- AF: Frequency of the alternate allele (mutation) in the sample. Calculated by dividing ALT_COUNT by DP.
- ALT_COUNT: number of sequencing reads associated with the alternate (mutant) allele in the sample.
- SAMP_CODON: If part of a gene, the codon arising from the alternate (mutant) allele in the sample, otherwise ‘NA’.
- SAMP_AA: If part of a gene, the amino acid arising from the alternate (mutant) allele in the sample, otherwise ‘NA’.
- ALT_ID: Identifier of the mutation in the sample.
- DP: Total sequencing depth at the genomic position associated with the mutation.
- SAMP_NAME: Sample name, which is the name of the input file, unless the name.sep option is used in call_mutations() to retain just a portion of the name.
- TYPE: The type of mutation identified. One of SNP-Syn (synonymous SNP within a gene), SNP-Nsyn (non-synonymous SNP within a gene), “SNP-Nongenic” (SNP outside of a genic region), Indel-Genic (insertion or deletion within a gene), Indel-Nongenic (insertion or deletion outside of a genic region), or “Unobserved Target”, which will only appear for the set of target mutations that don’t appear in the sample if the write.all.targets option is set to TRUE.
This final data frame can be assigned as an object in the global environment, and/or written to a file if write.mut.table is set to TRUE. By default, the columns included in the output are SAMP_NAME, CHR, POS, GENE, ALT_ID, AF, and DP. This data frame serves as required input to the explore_mutations()
and estimate_lineages()
functions, and must include at least the set of default columns to run either of these subsequent functions.
explore_mutations()
Summary
The explore_mutations()
function uses the output from call_mutations()
and opens an interactive RShiny dashboard that allows you to explore the data. The dashboard will have from 1 to 4 tabs at the top, depending on what combination of optional input files (if any) are provided…
Available Tabs
Lineages Present: Available if the lineage/mutation file is provided with the lineage.muts option. The top plot represents the proportion of “lineage-characteristic mutations” present in a sample. These “lineage-characteristic mutations” are the set of mutations from the lineage.muts file that occur only in the selected lineage. In other words, MixviR looks through the set of mutations provided and removes any that are shared by more than one lineage. The remaining mutations are used to generate the plots in this tab. The “Presence Threshold” slider on the left allows the user to set a threshold for the proportion of such mutations required to consider a lineage as present in the sample. For each lineage identified as present (exceeding the threshold), the frequences of the lineage-characteristic mutations that occur in the sample are used to estimate a relative frequency of the lineage in the sample - these estimated relative frequencies are shown in the bottom plot, and can be generated with either the median or mean values (specific metric can be selected in left panel). Note that while biologically the sums of these estimated proportions can’t exceed 1, there is no constraint with the way the raw estimates are calculated, and occasionaly the bars will exceed 1. When this happens, it may mean that at least one mutation that was identified and used as a “lineage-characteristic” mutation based on the provided list is, in reality, shared among two or more lineages. In cases where the sum exceeds 1, the default behavior is to proportionally scale the estimates down to sum to 1 (controlled by the “scale” option in the left panel). Additional details associated with each plot are provided as tooltips by hovering over plot features/regions.
New Mutations: Present if a “location/dates” file is provided with the dates option. Table that lists all mutations first observed (across the entire dataset) on or after the selected date. So, if the mutation “S_D614G” was not observed in any sample before 6/20/2021, was observed in a single sample on that date, and then continued to be observed in multiple samples after that date, that mutation would show up in the New Mutations table for the date 6/20/21, and for any date prior to that, but would not show up if 6/21/21 or any later date is selected. If the lineage.muts option is defined, a column is added to the table that includes all lineages the mutation is associated with.
Mutation Frequencies: Present if a location/dates file is provided with the dates option. Plots show estimated frequencies of a specific mutation(s) over time for one or more samples. Mutations should be entered in the form “S_D614G” for a substitution, or, for indels, “S_del144/144” for single amino-acid mutations, or “S_del143/145” for a multi-amino acid deletion. Multiple mutations can be entered by separating them with a comma. Multiple samples can be selected and will be distinguished by color on the plots. If more than one mutation is entered, each mutation is displayed as a separate facet on the plot. If in doubt about the naming convention of mutations, check the samp_mutations object or the mutation names in other tabs.
View Mutations: Lists all mutations observed for the selected sample. If the lineage.muts option is defined, a column is added to the table that includes all lineages the mutation is associated with. If write.all.targets was set to TRUE in
call_mutations()
, all mutations in the lineage.muts file are included, even if no associated reads were observed (sequencing depths at underlying genomic positions are reported, while the reported observed frequency will be zero). This table is searchable, and can be sorted by columns (SHIFT + click to sort on more than one column).
estimate_lineages()
Summary
The estimate_lineages()
function is meant to output the data from the “Variants Present” plots of explore_mutations()
in table form. You can choose to write data for all lineages analyzed or just the lineages identified as present in the sample (default) based on the presence.thresh threshold. The tables are printed to the screen and can also be written to a file. This function requires both the output from call_mutations()
and a lineage.muts file.