We have data from target sequencing genes (only targeted two genes). We analyzed the data by GATK pipeline. Since the data set is too small, we tried hard filtration on both SNP and indels. At the same time, we sequenced the same sample by whole exome sequencing and filter SNP by VQSR. The quality of VQSR results is much better than hard filtration results. For economic reason, we need to develop analysis pipeline for target sequencing, is it ok to incorporate the target sequencing data into an exome sequencing data (merge the VCF files), do VQSR? I just worried the true sites in target sequencing data have different features compared to true sites in whole exome sequencing data.
how can vqsr applied on small data set?
Question about recalibration
Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?
Thanks!
multi sample VQSR on single sample vcf files
Hi
We have 100 samples run through the GATK unified genotyper and then we merged all the VCF files to run the multi samples VQSR. (merged was done using VCFTOOLS). What attributes we should use in this case.
For multi sample called vcf we use these paramters:
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP -nt 2 --maxGaussians 4 --percentBadVariants 0.05
any help is deeply appreciated.
Thanks
Saurabh
Significant difference in VQSR or VariantAnnotation between v1.6 and v2.2-10
Hi,
I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10).
In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0.
When I looked at a sample to compare variant sites using VariantEval, it showed that
Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66
The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.
The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...)
The previous BAM processing and variant calls were made using v1.6.
For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.
As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.
Thanks in advance.
Katsuhito
VQSR error: "The provided VCF file is malformed at... "
I am seeing this error on single human WGS sample -
The provided VCF file is malformed at approximately line number "x": there are 557 genotypes while the header requires that 1525 genotypes be present for all records
Interestingly, when I run VQSR as part of the same pipeline on the same sample consecutive times, the "x" changes to different line numbers each time. I was wondering if someone could explain the meaning of the error message more?
Sensibility/sensitivity of VQSR processed VCF
Hi all,
I've somewhere in this site that before VQSR the FP rate is expected to be around 10% (I guess for UnifiedGenotyper). Are there some updated statistics for VQRS? For HaplotypeCaller? For Exome/WG data?
Another thing: we apply VQRS on all our analysis, we are trying to collect some validation statistics. We suspect that most of the FP have some particular "culprits" in VQRS (especially QD and MQ). Do you have some data about this?
Best
d
Mouse VQSR
I was wondering if anyone has used VQSR for a mouse related genome project. I am working with mm10 dbsnp and DNA-seq short insert data for multiple homozygous mouse samples. I have obtained decent results so far using the mm10 dbsnp as the training set, but was curious to see if anyone had any recommendations as to what settings to use. Any input is appreciated. I also have a lot of RNA-seq data, but that will come at a much later point in time. Thanks!
VQSR for multi-sample VCF
Hi,
I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG).
Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.
What is the best way to go about this?
Many thanks.
Does callset derived from HyplotyperCaller need run through VariantAnnotator for VQSR step?
Hi,
I just run HyplotypeCaller on a dataset. For the same dataset, I have run through Unified genotyper and then directly subjected the raw vcf from UG to VQSR step without the help of VariantAnnotator before and get through VQSR without any problem. However, when I try to subject the raw callset derived from HyplotypeCaller directly to VQSR step, the VQSR module complained about it and error message is below:
...
ERROR MESSAGE: Bad input: Values for HaplotypeScore annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See
http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
So after HyplotypeCaller, the derived vcf file needs to run though VariantAnnotator? Since Unified genotyper derived callset does not need the help of VariantAnnotator (all annotations needed for VQSR are included after UG), it seems not the case for HyplotypeCaller? I can run through VariantAnnotator for HyplotypeCaller derived vcf file, just want to make sure if my understanding is correct?
Thanks and best
Mike
VQSR, VQSLOD, and indels
Hi Mark, Eric -
First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-).
I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche?
To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis?
Best Wishes,
Manny.
VQSR for indels
Hello,
I am running Variant Quality Score Recalibration on indels with the following command.
java -Xmx8g -jar /raid/software/src/GenomeAnalysisTK-1.6-9-g47df7bb/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R /raid/references-and-indexes/hg19/bwa/hg19_lite.fa \
-input indel_output_all_chroms_combined.vcf \
--maxGaussians 4 -std 10.0 -percentBad 0.12 \
-resource:mills,known=true,training=true,truth=true,prior=12.0 /raid/Merlot/exome_pipeline_v1/ref/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-an QD -an FS -an HaplotypeScore -an ReadPosRankSum \
--ts_filter_level 95.0 \
-mode INDEL \
-recalFile /raid2/projects/STFD/indel_output_7.recal \
-tranchesFile /raid2/projects/STFD/indel_output_7.tranches \
-rscriptFile /raid2/projects/STFD/indel_output_7.plots.R
My tranches file reports only false positives for all tranches. When I run VQSR on SNPS, the tranches have many true positives and look similar to other tranch files reported on this site. I am wondering if anyone has similar experiences or suggestions?
Thanks
Weird multi-sample [119] exome VQSR tranche plot
Hello,
I am trying to run GATK on a sample of 119 exomes. I followed the GATK guidelines to process the fastq files. I used the following parameters to call the UnifiedGenotyper and VQSR [for SNPs]:
UnifiedGenotyper
-T UnifiedGenotyper
--output_mode EMIT_VARIANTS_ONLY
--min_base_quality_score 30
--max_alternate_alleles 5
-glm SNP
VQSR
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /media/transcription/cipn/5.pt/ref/hapmap_3.3.hg19.sites.vcf
-resource:omni,known=false,training=true,truth=false,prior=12.0 /media/transcription/cipn/5.pt/ref/1000G_omni2.5.hg19.sites.vcf
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /media/transcription/cipn/5.pt/ref/dbsnp_135.hg19.vcf.gz
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff
-mode SNP
I get a tranche plot, which does not look OK. The "Number of Novel Variants [1000s]" goes from -400 to 800 and the Ti/Tv ratio varies from 0.633 to 0.782 [the attach file link is not working for me and am unable to upload the plot]. Any suggestion to rectify this would be very helpful !
cheers,
Rahul
ts_filter_level vs. tranche settings
Hi,
I'm having a little trouble understanding the relationship between the -ts_filter_level and -tranche settings for VQSR. If I'm not mistaken the defaults are 99 and [100,99.9,99.0,90] respectively. When I run VQSR with these defaults, my tranches are altered because of the 99 ts filter level. I get:
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=TruthSensitivityTranche99.00to99.90,Description="Truth sensitivity tranche level at VSQ Lod: -0.1838 <= x < 3.1102">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00+,Description="Truth sensitivity tranche level at VQS Lod < -6135.0237">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00,Description="Truth sensitivity tranche level at VSQ Lod: -6135.0237 <= x < -0.1838">
Is it odd that there are two tranches with the same ts values and different VQSLOD values? If I adjust the ts filter level to 90, I get what I originally expected to see:
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=TruthSensitivityTranche90.00to99.00,Description="Truth sensitivity tranche level at VSQ Lod: 2.5901 <= x < 4.8133">
##FILTER=<ID=TruthSensitivityTranche99.00to99.90,Description="Truth sensitivity tranche level at VSQ Lod: -0.692 <= x < 2.5901">
##FILTER=<ID=TruthSensitivityTranche99.90to100.00+,Description="Truth sensitivity tranche level at VQS Lod < -6.11002079587E7">
Is it just me, or does this seem to be an incompatibility between the defaults values? Which is more important, correct ts filtering or correct tranche intervals? We will at times filter based on these tranches, so I'd like to be setting them correctly. Thanks.
Ben
How to use the VQSR -tranche argument
How should I use the VQSR -tranche argument?
From the tutorial I get that I should specify the list of doubles like this:
-tranche [100.0, 99.9, 99.0, 90.0]
http://www.broadinstitute.org/gatk/guide/topic?name=tutorials#id2805
But when I try that like this
java -jar GenomeAnalysisTK-2.6-3-gdee51c4/GenomeAnalysisTK.jar -T VariantRecalibrator -R ref.fa -input input.vcf -resource:snparray,known=true,training=true,truth=true,prior=15.0 input_concordantW_SNPArray.vcf -an QD -an ReadPosRankSum -an MQRankSum -an MQ -an FS -an DP -an ClippingRankSum -an BaseQRankSum -an AF -titv 2.5 --mode SNP -recalFile input.recal -tranchesFile input.tranches -rscriptFile input.plots.R -tranche [100.0, 99.9, 99.0, 90.0]
I get
`##### ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.6-3-gdee51c4):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid argument value '99.9,' at position 38.
ERROR Invalid argument value '99.0,' at position 39.
ERROR Invalid argument value '90.0]' at position 40.
ERROR ------------------------------------------------------------------------------------------`
Any way to get past "Clustering with this few variants and these annotations is unsafe."?
Hi team, thanks for a great job developing this software!
I am planning to use the GATK in a class as a demo of how to do SNP detection and the VQSR in a non-model organism, but due to time constraints I have a very small dataset (12 samples of 100K reads each).
I am using a SNP Q>20 for an initial round of SNP detection, which I then use as a "true" training set for the VQSR and use a call set with Q>3 as my variants of interest.
I keep getting the error message "NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)"
which is not surprising, even though I have already set --maxGaussians 2 -percentBad 0.01 -minNumBad 50
to reiterate, this is for educational purposes - I am wondering if I can move past this error message and get an output file despite this error?
Thanks!
/Pierre De Wit
VQSR
Hi,
I am working on dog genome and trying to use VQSR on my data.
Here is the command i have used:
java -Xmx4G -jar GenomeAnalysisTK.jar -R genome.fa -T VariantRecalibrator -input GATK-snp.vcf -resource:dbsnp,known=false,training=true,truth=true,prior=6.0 canFam3_SNP.vcf -mode SNP -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an Inbreed
- I have only dbSNP file as training set and i have set the options, known=true,training=false,truth=false,prior=6.0 in the command line as per the documentation. But that doesn't work and instead suggested to use known=false,training=true,truth=true,prior=6.0. What is the prior =6.0 here? is there any threshold for prior?
2.The above command produces empty tranches and recal file.
3.Even though the files are empty i have proceeded to ApplyRecalibration with the below command:
java -Xmx4G -jar GenomeAnalysisTK.jar -R genome.fa -T ApplyRecalibration -input GATK-snp.vcf --ts_filter_level 99.0 -tranchesFile output.tranches -recalFile output.recal -mode SNP -o recalibrated.filtered.vcf.
It gives the error:
ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
ERROR Name FeatureType Documentation
ERROR BCF2 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_bcf2_BCF2Codec.html
ERROR VCF VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_vcf_VCFCodec.html
ERROR VCF3 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_vcf_VCF3Codec.html
ERROR
Any help to fix these?
VQSR on GENOTYPE_GIVEN_ALLELES mode
When using GENOTYPE_GIVEN_ALLELES with HaplotypeCaller, which uses EMIT_ALL_SITES and so has many calls where the entire cohort is nonvariant, do these reference only sites have to be filtered out before calling VQSR?
ERROR stack trace ; Unable to retrieve result ; A GATK RUNTIME ERROR has occurred
Hi,
Thanks very much for your answers for my previous questions. It seems that I encountered another difficulties when I run the QVSR steps because some ERROR information was spotted on the screen. These Error info is as follows:
INFO 18:10:01,046 GaussianMixtureModel - Initializing model with 30 k-means iterations...
INFO 18:10:01,165 VariantRecalibratorEngine - Finished iteration 0.
INFO 18:10:01,186 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.15059
INFO 18:10:01,196 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.06115
INFO 18:10:01,206 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.34881
INFO 18:10:01,208 VariantRecalibratorEngine - Convergence after 16 iterations!
INFO 18:10:01,211 VariantDataManager - Found 0 variants overlapping bad sites training tracks.
INFO 18:10:27,971 ProgressMeter - chr1:249230318 4.34e+06 90.0 s 20.0 s 100.0% 90.0 s 0.0 s
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to retrieve result
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
Caused by: java.lang.NullPointerException
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.selectWorstVariants(VariantDataManager.java:278)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:333)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:132)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
... 5 more
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.7-2-g6bda569):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to retrieve result
ERROR ------------------------------------------------------------------------------------------
I think the parameter I set are all right:
java -jar /ifs1/ST_POP/USER/lantianming/HUM/bin/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar
-R /ifs1/ST_POP/USER/lantianming/HUM/reference_human/chr1.fa
--maxGaussians 4
-numBad 4000
-T VariantRecalibrator
-mode SNP
-input /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.recal_10.vcf
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/dbsnp_137.hg19.vcf
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/hapmap_3.3.hg19.vcf
-resource:omni,known=false,training=true,truth=false,prior=12.0 /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/1000G_omni2.5.hg19.vcf
-an DP -an FS -an HaplotypeScore -an MQ0 -an MQ -an QD
-recalFile /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.vcf.snp_11.recal
-tranchesFile /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.vcf.snp_11.tranches
-rscriptFile /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/split_1_22_X_Y_M/chr1/chr1.vcf.snp_11.plot.R -nt 4
--TStranche 90.0 --TStranche 93.0 --TStranche 95.0 --TStranche 97.0
My input file is chr1 AND the sequencing depth is about 1× AND 4000 snp sites were call out by using UnifiedGenotyper.
So what I am not sure is that whether the number of snp sites were enough for doing VQSR?
Could you please give me some suggestions? thanks very much!!!
VariantRecalibrator plot?
HI again!
Could you please help me to generate the first plot in the attached file which refers to VariantRecalibrator?
In other words, is this plot generated at the same time as my_sample.bqrecal.vqsr.R.scripts.pdf? If so, maybe some R library is missing but i can't find anything wrong in the log files (my_sample.bqrecal.vqsr.R.scripts.pdf seems to me fine adn healthy).
Thanks in advance,
Rodrigo.
NaN LOD in VQSR
Hi all, I'm running VariantRecalibrator on a SNP set (47 exomes) and I get this error:
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.2-3-gde33222):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
##### ERROR ------------------------------------------------------------------------------------------
this is the command line:
java -Djava.io.tmpdir=/lustre2/scratch/ -Xmx32g -jar /lustre1/tools/bin/GenomeAnalysisTK-2.2-3.jar \
-T VariantRecalibrator \
-R /lustre1/genomes/hg19/fa/hg19.fa \
-input /lustre1/workspace/Ferrari/Carrera/Analysis/UG/bpd_ug.SNP.vcf \
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 /lustre1/genomes/hg19/annotation/hapmap_3.3.hg19.sites.vcf.gz \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 /lustre1/genomes/hg19/annotation/1000G_omni2.5.hg19.sites.vcf.gz \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 /lustre1/genomes/hg19/annotation/dbSNP-137.chr.vcf -an QD \
-an HaplotypeScore \
-an MQRankSum \
-an ReadPosRankSum \
-an FS \
-an MQ \
-an DP \
-an QD \
-an InbreedingCoeff \
-mode SNP \
-recalFile /lustre2/scratch/Carrera/Analysis2/snp.ug.recal.csv \
-tranchesFile /lustre2/scratch/Carrera/Analysis2/snp.ug.tranches \
-rscriptFile /lustre2/scratch/Carrera/Analysis2/snp.ug.plot.R \
-U ALLOW_SEQ_DICT_INCOMPATIBILITY \
--maxGaussians 6
I've already tried to decrease the --maxGaussians
option to 4, I've also added --percentBad
option (setting it up to 0.12, as for INDEL) but I still get the error.
I've added the option -debug
to see what's happening, but apparently this has been removed from GATK-2.2.
Any help is appreciated...
thanks