Quantcast
Channel: vqsr — GATK-Forum
Viewing all 326 articles
Browse latest View live

Is BQSR necessary for de novo mutation calling?

$
0
0

Hi there,
I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step?
Also what should I do when I reach to step- VQSR?
Hope some GATK developers can help me on this.
Thanks.


VQSR

$
0
0

Hi,

I am working on non-human species data and i have used VQSR in the analysis pipeline as shown below:
If VQSR is performed, should we still consider filtering the variants on basequality and mapping quality?

VQSR for small exome data sets

$
0
0

Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).

Hopefully I'm missing an important point somewhere?
Many thanks in advance,
K

Detecting SNV in human populations

$
0
0

Hello Geraldine,

First thank you a lot for your amazing work on this forum. My project deals with discovering rare population-specific variants in human exomes, and I would like to know how the VQSR step would affect the discovery of these variants. I was wondering whether it is better to perform VQSR on all the populations together (420 individuals but with a risk to clean out "true" rare population-specific variants ) or to run it by population (between 30 and 100 individuals each but I read that VQSR is loosing power with a reduced number of samples) ?

Thank you for your help,
Best
Marie

Variant Recalibration on related Samples

$
0
0

Hi GATK team,

our lab has a never ending discussion about running VQSR on related samples or having to exclude them. And i guess we need your help to settle this.

We have a multisample call (UG) run on ~1.500 samples, which contains all sorts of unrelated samples, trios and small families. Our statistician tries to convince us to exclude all related samples, because this might skew the VQSR model. The biologists don't follow this argument, but we are unable to convince each other.
Do related samples disturb the VQSR?

Even more specific - if we run VQSR on tumor/normal pairs - should we expect surprising behaviour of the model or can we just run the recalibration without worries?

thanks for your help in advance,
Oliver

question about odd result for VQSR

$
0
0

Hi,

Recently I run into some odd observation in VQSR. I have 17 samples from a same family and I used all of 17 samples to call SNPs and after VQSR, I got the trench file like this:

Variant quality score tranches file

Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,48637,716,2.9527,2.3302,4.8390,VQSRTrancheSNP0.00to90.00,SNP,26182,23563,0.9000
99.00,60114,1531,2.8057,2.3333,1.7766,VQSRTrancheSNP90.00to99.00,SNP,26182,25920,0.9900
99.90,67220,2884,2.7190,1.8222,-10.0009,VQSRTrancheSNP99.00to99.90,SNP,26182,26155,0.9990
100.00,69714,4998,2.6822,1.8300,-1122.0698,VQSRTrancheSNP99.90to100.00,SNP,26182,26182,1.0000

which seems fine. then for research purpose, I only used 5 samples of more tight relation such as two parents and their 3 immediate children and after VQSR, the trench file looks like below:

Variant quality score tranches file

Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP0.00to90.00,SNP,20850,20850,1.0000
99.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP90.00to99.00,SNP,20850,20850,1.0000
99.90,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.00to99.90,SNP,20850,20850,1.0000
100.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000

Notice that the 5-sample VQSR tranch file has exactly the same thing throughout all thresholds: 90, 99, 99.90 and 100. and the VQSR modeling plot is also very odd, no plotting at all being seen (the pdf ifle was created but was almost blank in contrast to the normal projection plots I saw in other cases)

However, we did use the old version to call the same 5 samples before, and the trench file looks like below:

Variant quality score tranches file

Version number 4

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,36407,361,2.8657,2.3119,5.0854,TruthSensitivityTranche0.00to90.00,20814,18732,0.9000
99.00,44097,638,2.7655,2.2222,2.2592,TruthSensitivityTranche90.00to99.00,20814,20605,0.9900
99.90,47947,1061,2.7078,1.8750,-7.4143,TruthSensitivityTranche99.00to99.90,20814,20793,0.9990
100.00,50426,2318,2.6645,1.7677,-647.3944,TruthSensitivityTranche99.90to100.00,20814,20814,1.0000

this time, it looks reasonable to me. This is troubling us since for 5 samples, the old version (V1.6-7) seems working fine, whereas the new version (V2.1-13) seems having issue or can not get further filtering by VQSR (90, 99 and 100 got the same result, I did repeat multiple times and got the same results), although for all of the 17 samples, the new version seems fine on VQSR.

So my questions are:
1. is it possible that in some occasion, VQSR can simply not work?
2. Why the old version seems working but not the new version for exactly the same set of 5-sample data?

Thanks a lot for your help!

Mike

Random Forests VQSR

$
0
0

Hi there,

I hope I'm not being too forward here, but I was wondering if your group was still looking into implementing a RF model for VQSR (in particular I was hoping that it would help with smaller size datasets, in terms of the count of variant sites for smaller than exome captures) or if you have abandoned it?

Best Regards,

Kurt

Stack trace error while running VQSR

$
0
0

I'm trying to run VQSR on a vcf I just called with HaplotypeCaller. Here is my command:

java -Xmx32g -jar /Commands/GATK/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R /Reference/ucsc.hg19.fasta \
-input H3H5.HTC.raw.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Reference/hapmap3.3.hg19.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 /Reference/1000G.omni2.5.hg19.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 /Reference/1000G.ph1.SNP.HC.hg19.vcf \
-resource:dbsnp,known=true,training=true,truth=false,prior=6.0 /Reference/dbsnp138.hg19.vcf \
-an QD -an MQRankSum -an ReadPosRankSum -an FS \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile VQSR/H3H5.SNP.VQSR \
-tranchesFile VQSR/H3H5.SNP.Tranches \
-rscriptFile VQSR/H3H5.SNP.VQSR.R \
-nt 16

Each time I try to run VQSR it gives me this error:

INFO 12:23:34,122 VariantRecalibratorEngine - Finished iteration 95. Current change in mixture coefficients = 0.00198
INFO 12:23:34,122 VariantRecalibratorEngine - Convergence after 95 iterations!
INFO 12:23:34,461 VariantRecalibratorEngine - Evaluating full set of 251205 variants...
INFO 12:23:34,476 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
INFO 12:23:39,194 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result
at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Caused by: java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:399)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:143)
at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
... 5 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to retrieve result
ERROR ------------------------------------------------------------------------------------------

In the discussion of similar errors, I've seen that too little data and MQ annotation can cause similar problems, but I didn't use either of them.

I'm going to guess that it is something simple, but any help would be appreciated.

Chris


Error: not valid known dog vcf file when using VariantRecalibrator

$
0
0

Hello, I am working on dog targeted sequencing data. In VQSR step, I got error as below. For the record, I use canFam3.fa (from UCSC hg19) as reference and Canis_familiaris.newchr.vcf (Ensembel) as reource file, the two files didn't get error in previous steps. Did anyone have similar problem, Thanks for tips !

INFO 18:07:03,352 HelpFormatter - --------------------------------------------------------------------------------
INFO 18:07:03,355 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO 18:07:03,355 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 18:07:03,355 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 18:07:03,359 HelpFormatter - Program Args: -T VariantRecalibrator -R canFam3.fa -input ./variant_calling/FGC0805.target.raw.snps.indels.vcf -resource:dbsnp,known=false,training=true,truth=false,prior=12.0 Canis_familiaris.newchr.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile ./variant_calling/FGC0805.target.recalibrate.SNP.recal -tranchesFile ./variant_calling/FGC0805.target.recalibrate.SNP.tranches -rscriptFile ./variant_calling/FGC0805.target.recalibrate.SNP.plots.R
INFO 18:07:03,363 HelpFormatter - Executing as wangfan1@bioapps on Linux 2.6.32-358.14.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_65-b17.
INFO 18:07:03,364 HelpFormatter - Date/Time: 2014/12/02 18:07:03
INFO 18:07:03,364 HelpFormatter - --------------------------------------------------------------------------------
INFO 18:07:03,364 HelpFormatter - --------------------------------------------------------------------------------
INFO 18:07:04,430 GenomeAnalysisEngine - Strictness is SILENT
INFO 18:07:05,183 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 18:07:06,187 GenomeAnalysisEngine - Preparing for traversal
INFO 18:07:06,217 GenomeAnalysisEngine - Done preparing for traversal
INFO 18:07:06,218 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:07:06,219 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:07:06,219 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 18:07:06,231 TrainingSet - Found dbsnp track: Known = false Training = true Truth = false Prior = Q12.0
INFO 18:07:36,226 ProgressMeter - Starting 0.0 30.0 s 49.6 w 100.0% 30.0 s 0.0 s

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Invalid command line: No truth set found! Please provide sets of known polymorphic loci marked with the truth=true ROD binding tag. For example, -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf
ERROR ------------------------------------------------------------------------------------------

Best Practices VQSR parameters updated

$
0
0

The Best Practices recommendations for Variant Quality Score Recalibration have been slightly updated to use the new(ish) StrandOddsRatio (SOR) annotation, which complements FisherStrand (FS) as indicator of strand bias (only available in GATK version 3.3-0 and above).

While we were at it we also reconciled some inconsistencies between the tutorial and the FAQ document. As a reminder, if you ever find differences between parameters given in the VQSR docs, let us know, but FYI that the FAQ is the ultimate source of truth=true. Note also that the command line example given in VariantRecalibrator tool doc tends to be out of date because it can only be updated with the next release (due to a limitation of the tool doc generation system) and, well, we often forget to do it in time -- so it should never be used as a reference for Best Practice parameter values, as indicated in the caveat right underneath it which no one ever reads.

Speaking of caveats, there's no such thing as too much repetition of the fact that whole genomes and exomes have subtle differences that require some tweaks to your command lines. In the case of VQSR, that means dropping Coverage (DP) from your VQSR command lines if you're working with exomes.

Finally, keep in mind that the values we recommend for tranches are really just examples; if there's one setting you should freely experiment with, that's the one. You can specify as many tranche cuts as you want to get really fine resolution.

HaplotypeScore as an annotation for VariantRecalibrator after UnifiedGenotyper

$
0
0

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired

I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.

Workflow using HaplotypeCaller, GenotypeGVCFs, VQSR, and CalculateGenotypePosteriors

$
0
0

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Thanks very much in advance,

Alva

VariantRecalibrator - "N" reference allele only in .recal files

$
0
0

Hi,

I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's.

Thanks,
Alva

Use CombineVariants with chromosome-specific vcf files

$
0
0

Hi,

I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly.

Thank you very much in advance,
Alva

VQSR and sex chromosomes

$
0
0

Hi,

Maybe I have not been able to find some obvious piece of documentation, but I am searching for best practices in using VQSR with sex chromosomes (especially X)? I am trying to do variant calling on Anopheles gambiae genomes (sex chromosomes like human) and the results with chromosome X are not very encouraging. I was wondering if there is any documentation/best practices for VQSR with especially X. Or even if people are using VQSR with sex chromosomes?

Clueless and lost,
Tiago


how to filter SNPs/Indels after VQSR

$
0
0

when I finished VQSR, I got a vcf file "recalibrated_variants.vcf",

[wubin]$ awk -F"\t" 'NR>161{print $7}' recalibrated_variants.vcf|sort|uniq -c
65902 LowQual
3163999 PASS
122377 VQSRTrancheINDEL90.00to99.00
53509 VQSRTrancheINDEL99.00to99.90
4589 VQSRTrancheINDEL99.90to100.00
742359 VQSRTrancheSNP90.00to99.00
368105 VQSRTrancheSNP99.00to99.90
184493 VQSRTrancheSNP99.90to100.00

If I want 99% truth sites sensitivity, I can discard sites of

VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.90to100.00
VQSRTrancheSNP99.00to99.90
VQSRTrancheSNP99.90to100.00
LowQual

and retain sites of

PASS
VQSRTrancheINDEL90.00to99.00

Am I right ?

SOR annotation for Indels

$
0
0

Hi,

In the best practices for vqsr in indel mode it is recommended to use the annotation SOR. However, when I try to add this annotation using VariantAnnotator it only adds it to the SNP calls not the indel calls. Does this mean SOR should not be used for vqsr in indel mode?

Thanks,

Kath

importance of known sites/resources

$
0
0

Hi,
I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis.
How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file?
If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks,
NB

GATK VQSR tranches

$
0
0

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

HaplotypeCaller, GenotypeGVCFs, and VQSR for alternatives in dbSNP

$
0
0

From my whole-genome (human) BAM files, I want to obtain:
For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs:
CHROM POS ID REF ALT
chr1 1 . A # my data
chr1 1 . A T # dbSNP
I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases:
1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2.
3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases:
1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Thanks in advance!

Douglas

Viewing all 326 articles
Browse latest View live