Variant Quality Score Recalibration (VQSR)

Slides that explain the VQSR methodology as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

Introduction

The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model. The variant recalibrator contrastively evaluates variants in a two step process:

VariantRecalibration - Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants.
ApplyRecalibration - Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the lod threshold as provided by the user (with the ts_filter_level parameter).

Recalibration tutorial with example HiSeq, single sample, deep coverage, whole genome call set

By way of explaining how one uses the variant quality score recalibrator and evaluating its performance we have put together this tutorial which uses example sequencing data produced at the Broad Institute. All of the data used in this tutorial is available in VCF format from our GATK resource bundle.

Input call set

input: NA12878.HiSeq.WGS.bwa.cleaned.raw.b37.subset.vcf

These calls were generated with the UnifiedGenotyper from a 30X coverage modern, single sample run of HiSeq. They were randomly downsampled to keep the file size small but in general one would want to use the full set of variants available genome-wide for this procedure. No other pre-filtering steps were applied to the raw output.

Training sets

HapMap 3.3: hapmap_3.3.b37.sites.vcf

These high quality sites are used both to train the Gaussian mixture model and then again when choosing a LOD threshold based on sensitivity to truth sites.
The parameters for these sites will be: known = false, training = true, truth = true, prior = Q15 (96.84%)

Omni 2.5M chip: 1000G_omni2.5.b37.sites.vcf

These polymorphic sites from the Omni genotyping array are used when training the model.
The parameters for these sites will be: known = false, training = true, truth = false, prior = Q12 (93.69%)

dbSNP build 132: dbsnp_132.b37.vcf

The dbsnp sites are generally considered to be not of high enough quality to be used in training but here we stratify output metrics such as ti/tv ratio by presence in dbsnp (known sites) or not (novel sites).
The parameters for these sites will be: known = true, training = false, truth = false, prior = Q8 (84.15%)

The default prior for all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the UnifiedGenotyper is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

VariantRecalibrator

Detailed information about command line options for VariantRecalibrator can be found here.

Build a Gaussian mixture model using a high quality subset of the input variants and evaluate those model parameters over the full call set. The following notes describe the appropriate inputs to use for this tool.

Note that this walker expects call sets in which each record has been appropriately annotated (see e.g. VariantAnnotator). Input call set rod bindings must start with "input". See the command line below.
When constructing an initial call set (see e.g. Unified Genotyper or Haplotype Caller) for use with the Recalibrator, it's generally best to turn down the confidence threshold to allow more borderline calls (trusting the Recalibrator to keep the real ones while filtering out the false positives). For example, we often use a Q20 threshold on our deep coverage calls with the Recalibrator (whereas the default threshold in the UnifiedGenotyper is Q30).
No pre-filtering is necessary when using the Recalibrator. See below for the advanced options which allow the user to selectively ignore certain filters if they have already been applied to your call set.
The tool accepts any ROD bindings when specifying the set of truth sites to be used during modeling. Information about how to download VCF files which we routinely use for training is in the FAQ section at the bottom of the page.
Each training set ROD binding is specified with key-value tags to qualify whether the set should be considered as known sites, training sites, and/or truth sites. Additionally, the prior probability of being true for those sites is also specified via these tags in Phred scale. See the command line below for an example. An explanation for how each of the training sets is used by the algorithm:
Training sites: Input variants which are found to overlap with these training sites are used to build the Gaussian mixture model.
Truth sites: When deciding where to set the cutoff in VQSLOD sensitivity to these truth sites is used. Typically one might want to say I dropped my threshold until I got back 99% of HapMap sites, for example.
Known sites: The known / novel status of a variant isn't used by the algorithm itself and is only used for reporting / display purposes. The output metrics are stratified by known status in order to aid in comparisons with other call sets.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The first tranche is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibator walker, is shown on the right.

Tranches plot for example HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric [YES!]
The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

ApplyRecalibration

Detailed information about command line options for ApplyRecalibration can be found here.

Using the tranche file generated by the previous step the ApplyRecalibration walker looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their filter field annotated with its tranche level. This will result in a call set that simultaneously is filtered to the desired level but also has the information necessary to pull out more variants at a slightly lower quality level.

Frequently Asked Questions

How do I know which annotations to use for my data?

The five annotation values provided in the command lines above (QD, HaplotypeScore, MQRankSum, ReadPosRankSum, and HRun) have been show to give good results for a variety of data types. However this shouldn't be taken to mean these annotations give the absolute best modeling for every source of sequencing data. Better results could possibly be achieved through experimentation with which SNP annotations are used in the algorithm. The goal is to find annotation values with are approximately Gaussianly distributed and also serve to separate the probably true (known) SNPs from the probably false (novel) SNPs.

How do I know which -tranche arguments to pass into the VariantRecalibrator step?

The -tranche arguments main purpose is to create the tranche plot (as shown above). They are meant to convey the idea that with real, calibrated variant quality scores one can create call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way an end user can choose to use some of the filtered records or only use the PASSing records. For new users to the variant quality score recalibrator perhaps the easiest thing to do in the beginning is simply select the single desired false discovery rate and pass that value in as a single -tranche argument to make sure that the desired rate can be achieved given the other parameters to the algorithm.

What should I use as training data?

The VariantRecalibrator step accept lists of truth and training sites in several formats (dbsnp ROD, VCF, and BED, for example). Any list can be used but it is best to use only those sets which are of the best quality. The truth sets are passed into the algorithm using any rod binding name and their truth or training status is specified with rod tags (see VariantRecalibrator section above). We routinely use the HapMap v3.3 VCF file and the Omni2.5M SNP chip array in training the model. In general the false positive rate of dbsnp sites is too high to be used reliably for training the model.

HapMap v3.3 as well as the Omni validation array VCF files are available in our GATK resource bundle.

Does the VQSR work with non-human variant calls?

Absolutely! The VQSR accepts any list of sites to use as training / truth data, not just HapMap.

Don't have any truth data for your organism? No problem. There are several things one might experiment with. One idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP caller, of which the GATK is one, and use those sites which are concordant between the different methods as truth data. There are many fruitful avenues of research here. Hopefully the model reporting plots help facilitate this experimentation. Perhaps the best place to begin is to use a line like the following when specifying the truth set:

-resource:concordantSet,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf

Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties. One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line.

Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.