Quantcast
Channel: vqsr — GATK-Forum
Viewing all 326 articles
Browse latest View live

what does this mean in VQSR output vcf files?

$
0
0

I have the following entries in my vcf files output from VQSR. What does the "VQSRTrancheINDEL99.00to99.90" string mean? did they fail the recalibration?

PASS
VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.00to99.90
PASS
VQSRTrancheINDEL99.00to99.90
PASS
PASS
VQSRTrancheINDEL99.90to100.00
VQSRTrancheINDEL99.90to100.00
VQSRTrancheINDEL99.90to100.00
PASS
VQSRTrancheINDEL99.00to99.90
VQSRTrancheINDEL99.00to99.90

Below is the command I used:

java -Xmx6g -jar $CLASSPATH/GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R GATK_ref/hg19.fasta \
-nt 5 \
--input ../GATK/VQSR/parallel_batch/combined_raw.snps_indels.vcf \
-mode INDEL \
--ts_filter_level 99.0 \
-recalFile ../GATK/VQSR/parallel_batch/Indels/exome.indels.vcf.recal \
-tranchesFile ../GATK/VQSR/parallel_batch/Indels/exome.indels.tranches \
-o ../GATK/VQSR/parallel_batch/Indels/exome.indels.filtered.vcf

VariantRecalibration, numBadVariants, and size of the data set

$
0
0

I'm somewhat struggling with the new negative training model in 2.7. Specifically, this paragraph in the FAQ causes me trouble:

Finally, please be advised that while the default recommendation for --numBadVariants is 1000, this value is geared for smaller datasets. This is the number of the worst scoring variants to use when building the model of bad variants. If you have a dataset that's on the large side, you may need to increase this value considerably, especially for SNPs.

And so I keep thinking about how to scale it with my dataset, and I keep wanting to just make it a percentage of the total variants - which is of course the behavior that was removed! In the Version History for 2.7, you say

Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number

Can you comment a little further about how it scales? I'm assuming it's non-linear, and my intuition would be that smaller sets have proportionally more bad variants. Is that what you've seen? Do you have any other observations that could help guide selection of that parameter?

Interpreting VQSLOD and Tranche Quality in a Non-Human Model Organism

$
0
0

Hello there! Thanks as always for the lovely tools, I continue to live in them.

  • Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores.
    Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

how to run VQSR for individually called samples for rare variant discovery

$
0
0

We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?

Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?

Many thanks in advance.

Error Stack trace after running SelectVariants

$
0
0

I just wanted to select variants from a VCF with 42 samples. After 3 hours I got the following Error. How to fix this? please advise. Thanks
I had the same problem when I used "VQSR". How can I fix this problem?

INFO 20:28:17,247 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:28:17,250 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
INFO 20:28:17,250 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 20:28:17,251 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 20:28:17,255 HelpFormatter - Program Args: -T SelectVariants -rf BadCigar -R /groups/body/JDM_RNA_Seq-2012/GATK/bundle-2.3/ucsc.hg19/ucsc.hg19.fasta -V /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf -L chr1 -L chr2 -L chr3 -selectType SNP -o /hms/scratch1/mahyar/Danny/data/Filter/extract_SNP_only3chr.vcf
INFO 20:28:17,256 HelpFormatter - Date/Time: 2014/01/20 20:28:17
INFO 20:28:17,256 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:28:17,256 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:28:17,305 ArgumentTypeDescriptor - Dynamically determined type of /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf to be VCF
INFO 20:28:18,053 GenomeAnalysisEngine - Strictness is SILENT
INFO 20:28:18,167 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 20:28:18,188 RMDTrackBuilder - Creating Tribble index in memory for file /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf
INFO 23:15:08,278 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NegativeArraySizeException
at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:97)
at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:116)
at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:84)
at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:73)
at net.sf.samtools.util.AbstractIterator.next(AbstractIterator.java:57)
at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:46)
at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:24)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35)
at org.broad.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40)
at org.broad.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:428)
at org.broad.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:390)
at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:288)
at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:278)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140)
at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:964)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:758)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

VQSR and Snpcall strategies for readgroups with different coverage distributions

$
0
0

Hi - I have a question on how best to do VQSR on my samples. One of the readgroups for my individuals are from genomic DNA and have very even coverage (around 10x) while the remaining 4-5 readgroups in the individuals are from Whole Genome Amplified (WGA) DNA. The WGA readgruops have very uneven coverage ranging from 0 to over a 1000 with a mean of around 30x (see attached image, blue is wga and turquoise is genomic, y-axis is depth and x-axis is sliding windows along a chromosome). So I have WGA and genomic libs for each individual and their coverage distributions are very different.

We tested different SNP calling (Unified Genotyper) and VSQR strategies and at the moment we think a strategy where we call and vqsr the genomic and wga libs separately and then combine them in the end works best. However I am interested on what the GATK team would have done in such a case. The reason we are doing it separately is that we think the vqsr on the combined libs would not be wise since there is such difference in the depth (and strand bias) between the WGA and genomic readgroups. If there was a way in the VQSR step to incorporate read group difference and include it in the algortihm it could maybe solve such a problem - but as far as I can see there is no such thing (we used the ReadGroupblacklist option when calling the RGs separately) - but for VQSR there is not a "include read group effects" kind of option. Or does it intrinsically include read group information in the machine learning step? By the way - we did the BQSR so the qualities would have been adjusted according to readgroup effects. But still there does seem to be a noticeable difference between the VQSR results we get from WGA vs genomic read groups (for instance WGA readgroups have consistently lower Hz than genomic readgroups calls - which we think is due to strand bias). From the VQSR plots it is clear that many SNPs are excluded in the WGA RGs due to strand bias and DP - however the bias is still visible after VQSR.

Sorry for the elaborate explanation - however - my question is how the GATK team would have handled SNPcalling and VQSR if the RG depth vary that much as in the attached image case.

VQSR and Snpcall strategies for readgroups with different coverage distributions

$
0
0

Hi - I have a question on how best to do VQSR on my samples. One of the readgroups for my individuals are from genomic DNA and have very even coverage (around 10x) while the remaining 4-5 readgroups in the individuals are from Whole Genome Amplified (WGA) DNA. The WGA readgruops have very uneven coverage ranging from 0 to over a 1000 with a mean of around 30x (see attached image, blue is wga and turquoise is genomic, y-axis is depth and x-axis is sliding windows along a chromosome). So I have WGA and genomic libs for each individual and their coverage distributions are very different.

We tested different SNP calling (Unified Genotyper) and VSQR strategies and at the moment we think a strategy where we call and vqsr the genomic and wga libs separately and then combine them in the end works best. However I am interested on what the GATK team would have done in such a case. The reason we are doing it separately is that we think the vqsr on the combined libs would not be wise since there is such difference in the depth (and strand bias) between the WGA and genomic readgroups. If there was a way in the VQSR step to incorporate read group difference and include it in the algortihm it could maybe solve such a problem - but as far as I can see there is no such thing (we used the ReadGroupblacklist option when calling the RGs separately) - but for VQSR there is not a "include read group effects" kind of option. Or does it intrinsically include read group information in the machine learning step? By the way - we did the BQSR so the qualities would have been adjusted according to readgroup effects. But still there does seem to be a noticeable difference between the VQSR results we get from WGA vs genomic read groups (for instance WGA readgroups have consistently lower Hz than genomic readgroups calls - which we think is due to strand bias). From the VQSR plots it is clear that many SNPs are excluded in the WGA RGs due to strand bias and DP - however the bias is still visible after VQSR.

Sorry for the elaborate explanation - however - my question is how the GATK team would have handled SNPcalling and VQSR if the RG depth vary that much as in the attached image case.

best way of filtering out common SNPs in the GATK outputted VCF file

$
0
0

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T VariantFiltration \
       -o output.vcf \
       --variant input.vcf \
       --filterExpression "AB < 0.2 || MQ0 > 50" \
       --filterName "Nov09filters" \
       --mask mask.vcf \
       --maskName InDel

Picking a VQSR tranche for indels

$
0
0

Hi,

Given that there's no tranche plot generated for indels using VariantRecalibrator, how do we assess which tranche to pick for the next step, ApplyRecalibration? On SNP mode, I'm using tranche plots to evaluate the tradeoff between true and false positive rates at various tranche levels, but that's not possible with indels.

Thanks!

Grace

VQSR error: NaN LOD value assigned

$
0
0
INFO  17:05:50,124 GenomeAnalysisEngine - Preparing for traversal 
INFO  17:05:50,144 GenomeAnalysisEngine - Done preparing for traversal 
INFO  17:05:50,144 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  17:05:50,145 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  17:05:50,166 TrainingSet - Found hapmap track:    Known = false   Training = true     Truth = true    Prior = Q15.0 
INFO  17:05:50,166 TrainingSet - Found omni track:  Known = false   Training = true     Truth = false   Prior = Q12.0 
INFO  17:05:50,167 TrainingSet - Found dbsnp track:     Known = true    Training = false    Truth = false   Prior = Q6.0 
INFO  17:06:20,149 ProgressMeter -     1:216404576        2.04e+06   30.0 s       14.0 s      7.0%         7.2 m     6.7 m 
INFO  17:06:50,151 ProgressMeter -     2:223579089        4.70e+06   60.0 s       12.0 s     15.2%         6.6 m     5.6 m 
INFO  17:07:20,159 ProgressMeter -      4:33091662        7.43e+06   90.0 s       12.0 s     23.3%         6.4 m     4.9 m 
INFO  17:07:50,161 ProgressMeter -      5:92527959        1.00e+07  120.0 s       11.0 s     31.4%         6.4 m     4.4 m 
INFO  17:08:20,162 ProgressMeter -       7:1649969        1.30e+07    2.5 m       11.0 s     39.8%         6.3 m     3.8 m 
INFO  17:08:50,168 ProgressMeter -     8:106975025        1.58e+07    3.0 m       11.0 s     48.4%         6.2 m     3.2 m 
INFO  17:09:20,169 ProgressMeter -    10:101433561        1.87e+07    3.5 m       11.0 s     57.4%         6.1 m     2.6 m 
INFO  17:09:50,170 ProgressMeter -     12:99334147        2.16e+07    4.0 m       11.0 s     66.1%         6.1 m     2.1 m 
INFO  17:10:20,171 ProgressMeter -     15:30577012        2.41e+07    4.5 m       11.0 s     75.4%         6.0 m    88.0 s 
INFO  17:10:52,409 ProgressMeter -      18:8763648        2.68e+07    5.0 m       11.0 s     83.5%         6.0 m    59.0 s 
INFO  17:11:22,410 ProgressMeter -     22:31598896        2.97e+07    5.5 m       11.0 s     92.2%         6.0 m    27.0 s 
INFO  17:11:33,135 VariantDataManager - QD:      mean = 17.48    standard deviation = 9.03 
INFO  17:11:33,516 VariantDataManager - HaplotypeScore:      mean = 3.03     standard deviation = 2.62 
INFO  17:11:33,882 VariantDataManager - MQ:      mean = 52.40    standard deviation = 2.98 
INFO  17:11:34,253 VariantDataManager - MQRankSum:   mean = 0.31     standard deviation = 1.02 
INFO  17:11:37,973 VariantDataManager - Training with 1024360 variants after standard deviation thresholding. 
INFO  17:11:37,977 GaussianMixtureModel - Initializing model with 30 k-means iterations... 
INFO  17:11:53,065 ProgressMeter - GL000202.1:10465        3.08e+07    6.0 m       11.0 s     99.8%         6.0 m     0.0 s 
INFO  17:12:09,041 VariantRecalibratorEngine - Finished iteration 0. 
INFO  17:12:23,066 ProgressMeter - GL000202.1:10465        3.08e+07    6.5 m       12.0 s     99.8%         6.5 m     0.0 s 
INFO  17:12:30,492 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.08178 
INFO  17:12:51,054 VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.05869 
INFO  17:12:53,072 ProgressMeter - GL000202.1:10465        3.08e+07    7.0 m       13.0 s     99.8%         7.0 m     0.0 s 
INFO  17:13:11,207 VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.15237 
INFO  17:13:23,073 ProgressMeter - GL000202.1:10465        3.08e+07    7.5 m       14.0 s     99.8%         7.5 m     0.0 s 
INFO  17:13:31,503 VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.13505 
INFO  17:13:51,768 VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.05729 
INFO  17:13:53,080 ProgressMeter - GL000202.1:10465        3.08e+07    8.0 m       15.0 s     99.8%         8.0 m     0.0 s 
INFO  17:14:11,372 VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.02607 
INFO  17:14:23,081 ProgressMeter - GL000202.1:10465        3.08e+07    8.5 m       16.0 s     99.8%         8.5 m     0.0 s 
INFO  17:14:24,730 VariantRecalibratorEngine - Convergence after 33 iterations! 
INFO  17:14:27,037 VariantRecalibratorEngine - Evaluating full set of 3860460 variants... 
INFO  17:14:51,111 VariantDataManager - Found 0 variants overlapping bad sites training tracks. 
INFO  17:14:55,071 VariantDataManager - Additionally training with worst 1000 scoring variants --> 1000 variants with LOD <= -30.5662. 
INFO  17:14:55,071 GaussianMixtureModel - Initializing model with 30 k-means iterations... 
INFO  17:14:55,082 VariantRecalibratorEngine - Finished iteration 0. 
INFO  17:14:55,095 VariantRecalibratorEngine - Convergence after 4 iterations! 
INFO  17:14:55,096 VariantRecalibratorEngine - Evaluating full set of 3860460 variants... 
INFO  17:15:02,071 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.7-2-g6bda569): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --numBad 3000, for example).
##### ERROR ------------------------------------------------------------------------------------------

My command is :

java -jar -Xmx4g GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37.fasta -input NA12878_snp.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_132.b37.vcf -an QD -an HaplotypeScore -an MQ -an MQRankSum --maxGaussians 4 -mode SNP -recalFile NA12878_recal.vcf -tranchesFile NA12878_tranches -rscriptFile NA12878.plots.R

Before I didn't use -maxGaussians 4, once an error suggested this, I tried but still got this error message...And I think that numBad is already deprecated. I don't understand why this error will happen. I'm doing GATK unifiedgenotyper on 1000Genomes high coverage bam file and then use VQSR to filter the snp.

VQSR plots

$
0
0

hi
i run VQSR on the vcf file generated by unified genotyper and filtered PASS 63412 out of 86840 (files with snps and indels). as i run unified genotyper with -glm BOTH command. i have two questions

1) the number of pass snps are different when i counted them in two ways(first with original output of UG and other by separating snps and indel into two separate files using awk script

grep -v "#" sample1_recalibrated_snps_PASS.vcf | grep -c "PASS"
63412
grep -v "#" sample1_merged_recalibrated_snps_raw_indels.vcf| grep -c "LowQual“
18725

Statistics for separate snp file. here i use awk script to separate snps and indels (using awk script)

Rest is fine only problem is that pass snps no differ think why

grep -v  "^#" sample1_snp.vcf| grep -c "PASS
63402
grep -v  "^#" sample1_snp.vcf| grep -c "LowQual“
18725

2) i run VQSR on snps generated by unified genotyper i need to ask query about VQSR tranche plot for Snps.
in my case tranche is not showing any false positive call see plot attached what do i interpret that there is no FP which seems surprising

when i tried to run VQSR on INDELS (in the same file) it doesnt work as i had 884 indels which i read from VQSR documentation and questions asked by ppl is small.

best way of filtering out common SNPs in the GATK outputted VCF file

$
0
0

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T VariantFiltration \
       -o output.vcf \
       --variant input.vcf \
       --filterExpression "AB < 0.2 || MQ0 > 50" \
       --filterName "Nov09filters" \
       --mask mask.vcf \
       --maskName InDel

Mills reference for indel VQSR

$
0
0

Hi all --

This should be a simple problem -- I cannot find a valid version of the Mills indel reference in the resource bundle, or anywhere else online!

All versions of the reference VCF are stripped of genotypes and do not contain a FORMAT column or any additional annotations.

I am accessing the Broad's public FTP, and none of the Mills VCF files in bundle folders 2.5 or 2.8 contain a full VCF. I understand that there are "sites only" VCF, but I can't seem to find anything else.

Can anyone link me to a version that contains the recommended annotations for indel VQSR, or that can be annotated?

VQSR bundle based on b37 and hg19

$
0
0

Hi,

Sorry to bother you guys. Just a few quick questions:

1) I'm attempting to download the bundles for VQSR and I noticed that they are for b37 or hg19. If I performed my initial assemblies and later SNP calls with hg38, will this cause an issue? Should I restart the process using either b37 or hg19?

2) I'm still a bit lost on what is considered "too few variants" for VQSR. As VQSR works best when there are thousands of variants - is this recommendation on a per sample basis or for an entire project? I'm presently working with sequences from 80 unique samples for a single gene (~100kbp) and HaplotypeCaller detects on average ~300 raw snps. Would you recommend I hard filter instead in my case?

Thanks,

Dave

VQSR on single exome

$
0
0

hi, Geraldine,
Thanks for the webinar! You mentioned that VQSR isn't necessary for a single exome. But would there be any drawback to run it on a single exome? I see that it helps to set up the PASS filter.


VQSR on ~500 genomes

$
0
0

Hi,

I am working on VQSR step (using GATK 2.8.1) on variants which have been called by UG from ~500 whole genomes of cattle .
I run VariantRecalibrator as following:

${JAVA} ${GATK}/GenomeAnalysisTK.jar -T VariantRecalibrator \
-R ${REF} -input ${OUTPUT}/GATK-502-sorted.full.vcf.gz \
-resource:HD,known=false,training=true,truth=true,prior=15.0  HD_bosTau6.vcf \
-resource:JH_F1,known=false,training=true,truth=false,prior=10.0  F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0  BosTau6_dbSNP138_NCBI.vcf \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP -an HaplotypeScore \
-mode SNP \
-recalFile ${OUTPUT}/gatk_502_sorted_fixed.recal \
-tranchesFile ${OUTPUT}/gatk_502_sorted_fixed.tranches \
-rscriptFile ${OUTPUT}/gatk_502_sorted_fixed.plots.R

HD_bosTau6.vcf : ~770k markers on Illumina bovine high-density chip array

F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf : ~5.4M SNPs

The tranches pdf I got looks really weird, please check the attached file.
imageimage

Then I tried to vary the 'prior' score of trainning VCF, and also supply additional VCF file from another project as training datasets. And I still got the similar tranches graph as above. e.g.:

-resource:HD,known=false,training=true,truth=true,prior=15.0  HD_bosTau6.vcf 
-resource:JH_F1,known=false,training=true,truth=false,prior=12.0  F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf 
-resource:DN,known=false,training=true,truth=false,prior=12.0  HC-Plat-FB.3in3.vcf.gz 
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0  BosTau6_dbSNP138_NCBI.vcf 

HC-Plat-FB.3in3.vcf.gz : ~ 14M markers

It is worthy to mention that I have done VariantRecalibrator step with the same parameters and training sets on another 50 whole genomes very recently, and it worked fine.
Actually I have done VariantRecalibrator on the 500 animals before when I accidentally took a unfiltered VCF called by UG as training set. Surprisingly, I got good tranches graph that time, similar to the graph posted on GATK best practice.
Do you have any suggestion for me?

Thanks,

Bug in VariantRecalibrator

$
0
0

Hi there,

So for the SNV model in VariantRecalibrator, I was using QD, MQRankSum, ReadPosRankSum, FS for a little while and then decided to add MQ back in since I saw that BP was updated recently and that was back in for BP.

However; when I added MQ back in, and it went to train the negative model, it said it was training with 0 variants (same data set w/o using MQ in the model yielded ~30,000 variants to be used in the negative training model). I have attached a text file that has the base command line, followed by the log from the unsuccessful run and then followed by the successful run log. The version 3.1-1 and there are approx 700 exomes.

Kurt

VQSR VariantRecalibrator Rplot

$
0
0

java -jar -Djava.io.tmpdir=temp/ -Xmx4g GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19.fa -input NA19240.raw.SNPs.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.refmt.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_138.b37.refmt.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP -mode SNP -recalFile NA19240.raw.SNPs.recal -tranchesFile NA19240.raw.SNPs.tranches -rscriptFile NA19240.snp.plots.R

However, there is no NA19240.snp.plots.R.pdf generated. And I didn't find any error.
When I try to run NA19240.snp.plots.R in R, source('NA19240.snp.plots.R'), there is an error:
Error: Use 'theme' instead. (Defunct; last used in version 0.9.1)

How can I fix it? Thanks!!

VQSR on exome of small specific population

$
0
0

Hello,
I've asked this question at the workshop in Brussels, and I would like to post it here:
I'm working on an exome analysis on trio. I would like to run VQSR of filteration on the data. since this is an exome project, there are not a lot of varients, and therefore, as I understand. the VQSR is not accurate. You suggest to add more data from 1000Genomes or other published data.
The families that I'm working on belongs to a very small and specific population, and I'm afraid that adding published data will add a lot of noise.
What do you think, should I add more published data? change parameters such as maxGaussians? do hard filteration?

Thanks,
Maya

VQSR filter

$
0
0

Hi,

I have exome sequencing data on 90 samples, and my lab uses the VQSR filter to remove low quality variants. I was wondering if I should also perform a genotype-level filter by DP/GQ post this VQSR filtering step. Is there a protocol that is recommended, or some metrics I can look at to determine if such a step is required?

Thanks,
Shweta

Viewing all 326 articles
Browse latest View live