Quantcast
Channel: vqsr — GATK-Forum
Viewing all 326 articles
Browse latest View live

Single Sample VQSR

$
0
0

Hi all!

I've got a questions concerning the VQSR.

The situation is as follows:
- I've got more than 100 Single Sample VCFs
- Unfortunately I wont be able to re-call the VCFs
- Merging the Files into a single Multi-Sample VCF is, in my opinion, a bad idea due to the loss of the information stored in the INFO field
- Creating Multi-Sample VCFs with the help of 1000G would require re-calling or merging, so this also no option.

Therefore, more or less just to see what happens, I specified multiple inputs for the VariantRecalibrator Walker and was able to produce a recal and tranches file. However, its probably still a bad idea to use the recal file for Recalibration since now there are multiple entries for the same variant (this is most likely due to the same variant in multiple single-sample VCFs?)

chr1 871334 . N . . END=871334;POSITIVE_TRAIN_SITE;VQSLOD=1.9214;culprit=MQRankSum
chr1 871334 . N . . END=871334;POSITIVE_TRAIN_SITE;VQSLOD=2.0305;culprit=MQ

I guess during the ApplyRecalibration, its not possible to decide which entry for a variant in Single Sample VCF X1 is the correct one. However this would be crucial since the entries show different VQSLOD values.

So in my opinion, its probably not possible to use VQSR in my specific case. However, since I really would like to use it, I thought maybe one of you guys knows a possibility to use it despite all the problems.

Thanks a lot!


VQSR PASS "Variants" with AC=0, AF=0, is this normal behavior?

$
0
0

I am using GATK 3.4-0 / VQSR to evaluate variants called in ~300 germline genomes of varying coverage (8x-40x). When I do so, many "variants" pass using even a stringent tranche, despite not having a QUAL value or having AC=0 / AF=0. Additionally, some PASS variants have data fields GT:AD:DP, and others have GT:AD:DP:GQ:PL. Why does this occur?

For instance, VQSR Tranche SNP 99.00 to 99.90: -4.4822 <= x < -0.1962

Example:
chr21 1419 . T G . PASS AC=0;AF=0.00;AN=524;BaseQRankSum=-1.733e+00;DP=5194;MQ=31.17;MQ0=0;MQRankSum=0.00;NCC=1;ReadPosRankSum=1.73;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VQSLOD=1.67;culprit=DP GT:AD:DP 0/0:2,0:2 0/0:14,0:14 0/0:25,0:25 0/0:1,0:1 0/0:20,0:20 0/0:36,0:36 ./.:8,2 0/0:4,0:4 0/0:4,0:4 0/0:12,0:12 0/0:38,0:38 0/0:3,0:3 0/0:39,0:39 0/0:39,0:39 0/0:14,0:14 0/0:22,0:22 0/0:4,0:4 0/0:22,0:22 0/0:17,0:17 0/0:25,0:25 0/0:32,0:32 0/0:3,0:3 0/0:32,0:32 0/0:30,0:30 0/0:7,0:7 0/0:76,0:76 0/0:28,0:28 0/0:14,0:14 0/0:36,0:36 0/0:24,0:24 0/0:24,0:24 0/0:17,0:17 0/0:7,0:7 0/0:37,0:37 0/0:23,0:23 0/0:23,0:23 0/0:20,0:20 0/0:4,0:4 0/0:4,0:4 0/0:31,0:31 0/0:3,0:3 0/0:24,0:24 0/0:9,0:9 0/0:13,0:13 0/0:5,0:5 0/0:38,0:38 0/0:5,0:5 0/0:23,0:23 0/0:16,0:16 0/0:11,0:11 0/0:32,0:32 0/0:38,0:38 0/0:56,0:56 0/0:28,0:28 0/0:32,0:32 0/0:46,0:46 0/0:18,0:18 0/0:6,0:6 0/0:29,0:29 0/0:7,0:7 0/0:50,0:50 0/0:14,0:14 0/0:33,0:33 0/0:19,0:19 0/0:33,0:33 0/0:28,0:28 0/0:26,0:26 0/0:6,0:6 0/0:19,0:19 0/0:19,0:19 0/0:26,0:26 0/0:23,0:23 0/0:16,0:16 0/0:26,0:26 0/0:22,0:22 0/0:20,0:20 0/0:18,0:18 0/0:18,0:18 0/0:29,0:29 0/0:1,0:1 0/0:34,0:34 0/0:67,0:67 0/0:22,0:22 0/0:16,0:16 0/0:37,0:37 0/0:19,0:19 0/0:5,0:5 0/0:23,0:23 0/0:26,0:26 0/0:24,0:24 0/0:31,0:31 0/0:27,0:27 0/0:23,0:23 0/0:33,0:33 0/0:6,0:6 0/0:42,0:42 0/0:9,0:9 0/0:7,0:7 0/0:34,0:34 0/0:16,0:16 0/0:28,0:28 0/0:5,0:5 0/0:11,0:11 0/0:25,0:25 0/0:39,0:39 0/0:22,1:23 0/0:28,0:28 0/0:15,0:15 0/0:30,0:30 0/0:19,0:19 0/0:5,0:5 0/0:2,0:2 0/0:2,0:2 0/0:8,0:8 0/0:3,0:3 0/0:16,0:16 0/0:31,0:31 0/0:5,0:5 0/0:26,0:26 0/0:20,0:20 0/0:14,0:14 0/0:23,0:23 0/0:49,0:49 0/0:14,0:14 0/0:23,0:23 0/0:48,0:48 0/0:52,1:53 0/0:5,0:5 0/0:6,0:6 0/0:31,0:31 0/0:44,0:44 0/0:7,0:7 0/0:10,0:10 0/0:18,0:18 0/0:14,1:15 0/0:11,0:11 0/0:16,0:16 0/0:17,0:17 0/0:22,0:22 0/0:24,1:25 0/0:15,0:15 0/0:21,0:21 0/0:11,0:11 0/0:12,0:12 0/0:5,1:6 0/0:51,0:51 0/0:40,0:40 0/0:63,0:63 0/0:49,0:49 0/0:73,0:73 0/0:90,0:90 0/0:38,0:38 0/0:56,0:56 0/0:43,0:43 0/0:39,0:39 0/0:45,0:45 0/0:41,0:41 0/0:58,0:58 0/0:22,0:22 0/0:25,0:25 0/0:28,0:28 0/0:30,0:30 0/0:17,0:17 0/0:5,0:5 0/0:2,0:2 0/0:6,0:6 0/0:6,0:6 0/0:16,0:16 0/0:10,0:10 0/0:4,0:4 0/0:11,0:11 0/0:5,0:5 0/0:3,0:3 0/0:2,0:2 0/0:2,0:2 0/0:10,0:10 0/0:22,0:22 0/0:15,0:15 0/0:13,0:13 0/0:22,0:22 0/0:10,0:10 0/0:17,0:17 0/0:15,0:15 0/0:14,0:14 0/0:33,0:33 0/0:14,0:14 0/0:22,0:22 0/0:21,0:21 0/0:22,1:23 0/0:20,0:20 0/0:18,0:18 0/0:14,0:14 0/0:29,0:29 0/0:12,0:12 0/0:21,0:21 0/0:13,0:13 0/0:23,0:23 0/0:21,0:21 0/0:20,0:20 0/0:13,0:13 0/0:9,0:9 0/0:13,0:13 0/0:14,0:14 0/0:23,0:23 0/0:19,0:19 0/0:14,0:14 0/0:19,0:19 0/0:22,0:22 0/0:10,0:10 0/0:15,0:15 0/0:26,0:26 0/0:30,0:30 0/0:16,0:16 0/0:9,0:9 0/0:12,0:12 0/0:12,0:12 0/0:11,0:11 0/0:17,0:17 0/0:10,0:10 0/0:12,0:12 0/0:16,0:16 0/0:9,0:9 0/0:9,0:9 0/0:11,0:11 0/0:14,0:14 0/0:3,0:3 0/0:12,0:12 0/0:5,0:5 0/0:4,0:4 0/0:9,0:9 0/0:4,0:4 0/0:19,0:19 0/0:21,0:21 0/0:6,0:6 0/0:7,0:7 0/0:10,0:10 0/0:13,0:13 0/0:11,0:11 0/0:2,0:2 0/0:9,0:9 0/0:9,0:9 0/0:5,0:5 0/0:6,0:6 0/0:32,0:32 0/0:7,0:7 0/0:12,0:12 0/0:8,0:8 0/0:2,0:2 0/0:19,0:19 0/0:9,0:9 0/0:3,0:3 0/0:8,0:8 0/0:19,0:19 0/0:7,0:7 0/0:9,0:9 0/0:14,0:14 0/0:7,0:7 0/0:8,0:8 0/0:5,0:5 0/0:11,0:11 0/0:14,0:14 0/0:8,1:9 0/0:12,0:12

chr21 1432 . C T 1609.66 PASS AC=5;AF=9.542e-03;AN=524;BaseQRankSum=3.28;DP=6194;FS=0.000;GQ_MEAN=60.76;GQ_STDDEV=44.35;InbreedingCoeff=-0.0114;MLEAC=5;MLEAF=9.542e-03;MQ=39.30;MQ0=0;MQRankSum=0.780;NCC=1;QD=13.64;ReadPosRankSum=-5.450e-01;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;SOR=0.627;VQSLOD=0.834;culprit=FS GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,49 0/0:35,0:35:99:0,102,1530 0/0:39,0:39:93:0,93,1395 0/0:7,0:7:18:0,18,270 0/0:24,0:24:72:0,72,637 0/0:40,0:40:99:0,108,1620 0/0:2,0:2:6:0,6,47 0/0:7,0:7:21:0,21,184 0/0:6,0:6:18:0,18,160 0/0:17,1:18:23:0,23,376 0/0:24,0:24:66:0,66,990 0/0:9,0:9:21:0,21,315 0/0:51,0:51:99:0,120,1800 0/0:39,0:39:99:0,108,1620 0/0:9,0:9:27:0,27,240 0/0:18,0:18:48:0,48,720 0/0:7,0:7:18:0,18,270 0/0:24,0:24:69:0,69,1035 0/0:31,0:31:78:0,78,1170 0/1:18,18:.:99:492,0,409 0/0:30,0:30:84:0,84,1260 0/0:16,0:16:48:0,48,437 0/0:25,0:25:66:0,66,990 0/0:27,0:27:63:0,63,945 0/0:11,0:11:33:0,33,284 0/0:82,0:82:99:0,120,1800 0/0:63,0:63:99:0,120,1800 0/1:2,7:.:19:188,0,19 0/0:45,0:45:99:0,120,1800 0/0:35,0:35:90:0,90,1350 0/0:47,0:47:99:0,120,1800 0/0:16,0:16:48:0,48,451 0/0:19,0:19:45:0,45,675 0/0:47,0:47:99:0,120,1800 0/0:30,0:30:72:0,72,1080 0/0:22,0:22:54:0,54,810 0/0:7,0:7:18:0,18,270 0/0:2,0:2:6:0,6,63 0/0:5,0:5:12:0,12,180 0/0:66,0:66:99:0,120,1800 0/0:6,0:6:15:0,15,225 0/0:28,0:28:75:0,75,1125 0/0:13,1:14:11:0,11,313 0/0:12,0:12:33:0,33,495 0/0:16,0:16:45:0,45,675 0/0:58,0:58:99:0,120,1800 0/0:24,0:24:66:0,66,990 0/0:27,0:27:78:0,78,1170 0/0:25,0:25:66:0,66,990 0/0:25,0:25:75:0,75,660 0/0:41,0:41:99:0,114,1710 0/0:24,0:24:60:0,60,900 0/0:48,0:48:99:0,120,1800 0/0:44,0:44:99:0,120,1800 0/0:30,0:30:81:0,81,1215 0/0:48,0:48:99:0,120,1800 0/0:23,0:23:69:0,69,605 0/0:3,0:3:9:0,9,75 0/0:33,0:33:84:0,84,1260 0/0:8,0:8:21:0,21,315 0/0:75,0:75:99:0,120,1800 0/0:31,0:31:87:0,87,1305 0/0:39,0:39:99:0,111,1665 0/0:15,1:16:20:0,20,356 0/0:61,0:61:99:0,120,1800 0/0:28,0:28:75:0,75,1125 0/0:38,0:38:99:0,102,1530 0/0:11,0:11:30:0,30,450 0/0:15,0:15:45:0,45,449 0/0:24,0:24:72:0,72,633 0/0:29,0:29:84:0,84,1260 0/0:27,0:27:72:0,72,1080 0/0:19,0:19:54:0,54,810 0/0:24,0:24:72:0,72,634 0/0:21,0:21:57:0,57,855 0/0:18,0:18:51:0,51,765 0/0:14,0:14:39:0,39,585 0/0:20,0:20:48:0,48,720 0/0:27,0:27:72:0,72,1080 0/0:2,0:2:3:0,3,45 0/0:32,0:32:81:0,81,1215 0/0:65,0:65:99:0,120,1800 0/0:22,0:22:66:0,66,581 0/0:15,0:15:39:0,39,585 0/0:56,0:56:99:0,120,1800 0/1:13,17:.:99:474,0,273 0/0:8,0:8:21:0,21,315 0/0:23,1:24:60:0,60,580 0/0:29,0:29:75:0,75,1125 0/0:36,0:36:96:0,96,1440 0/0:31,0:31:90:0,90,1350 0/0:29,0:29:78:0,78,1170 0/0:26,0:26:69:0,69,1035 0/0:33,0:33:87:0,87,1305 0/0:13,0:13:30:0,30,450 0/0:64,0:64:99:0,120,1800 0/0:8,0:8:24:0,24,201 0/0:12,0:12:36:0,36,342 0/0:38,0:38:93:0,93,1395 0/0:13,0:13:30:0,30,450 0/0:38,0:38:99:0,105,1575 0/0:6,0:6:15:0,15,225 0/0:6,0:6:18:0,18,141 0/0:26,0:26:75:0,75,1125 0/0:63,0:63:99:0,120,1800 0/0:9,0:9:24:0,24,360 0/0:22,0:22:63:0,63,945 0/0:17,0:17:45:0,45,675 0/0:54,0:54:99:0,120,1800 0/0:18,0:18:45:0,45,675 0/0:9,0:9:24:0,24,360 0/0:2,0:2:6:0,6,63 0/0:5,0:5:12:0,12,180 0/0:11,0:11:30:0,30,450 0/0:4,0:4:9:0,9,135 0/0:11,0:11:30:0,30,450 0/0:33,0:33:87:0,87,1305 0/0:4,0:4:12:0,12,100 0/0:23,0:23:60:0,60,900 0/0:29,0:29:78:0,78,1170 0/0:18,0:18:45:0,45,675 0/1:14,15:.:99:392,0,276 0/0:65,0:65:99:0,120,1800 0/0:31,0:31:78:0,78,1170 0/0:44,0:44:99:0,117,1755 0/0:93,0:93:99:0,120,1800 0/0:75,0:75:99:0,120,1800 0/0:6,0:6:18:0,18,155 0/0:7,0:7:15:0,15,225 0/0:33,0:33:93:0,93,1395 0/0:45,0:45:99:0,120,1800 0/0:17,0:17:45:0,45,675 0/0:14,0:14:42:0,42,377 0/0:28,0:28:78:0,78,1170 0/0:14,0:14:42:0,42,392 0/0:15,0:15:45:0,45,384 0/0:14,0:14:39:0,39,585 0/0:24,0:24:66:0,66,99 0/0:18,0:18:51:0,51,765 0/0:17,0:17:42:0,42,630 0/0:24,0:24:63:0,63,945 0/0:25,0:25:72:0,72,1080 0/0:16,0:16:48:0,48,413 0/0:14,0:14:36:0,36,54 0/0:6,0:6:15:0,15,225 0/0:56,0:56:99:0,120,1800 0/0:47,1:48:99:0,120,1800 0/0:68,0:68:99:0,120,1800 0/0:62,1:63:99:0,120,1800 0/0:64,0:64:99:0,120,1800 0/0:108,1:109:99:0,120,1800 0/0:60,2:62:99:0,120,1800 0/0:76,0:76:99:0,120,1800 0/0:38,0:38:99:0,108,1620 0/0:45,0:45:99:0,120,1800 0/0:39,1:40:87:0,87,1009 0/0:48,0:48:99:0,120,1800 0/0:33,0:33:90:0,90,1350 0/0:29,0:29:75:0,75,1125 0/0:33,0:33:81:0,81,1215 0/0:29,0:29:87:0,87,759 0/0:28,0:28:69:0,69,1035 0/0:22,0:22:60:0,60,900 0/0:7,0:7:21:0,21,183 0/0:6,0:6:15:0,15,225 0/0:7,0:7:15:0,15,225 0/0:15,0:15:36:0,36,540 0/0:9,1:10:4:0,4,195 0/0:7,0:7:21:0,21,175 0/0:11,0:11:21:0,21,315 0/0:4,0:4:9:0,9,135 0/0:9,0:9:27:0,27,239 0/0:5,0:5:12:0,12,180 0/0:7,0:7:21:0,21,197 0/0:7,0:7:18:0,18,270 0/0:13,0:13:30:0,30,450 0/0:18,0:18:51:0,51,765 0/0:18,0:18:51:0,51,765 0/0:17,0:17:51:0,51,442 0/0:21,0:21:54:0,54,810 0/0:19,0:19:51:0,51,765 0/0:22,0:22:57:0,57,855 0/0:16,0:16:45:0,45,675 0/0:22,0:22:66:0,66,594 0/0:30,0:30:75:0,75,1125 0/0:23,0:23:63:0,63,945 0/0:8,0:8:24:0,24,237 0/0:21,0:21:57:0,57,855 0/0:14,0:14:42:0,42,389 0/0:14,0:14:39:0,39,585 0/0:13,0:13:36:0,36,540 0/0:15,0:15:45:0,45,371 0/0:11,0:11:33:0,33,266 0/0:24,0:24:69:0,69,1035 0/0:21,0:21:57:0,57,855 0/0:16,0:16:42:0,42,630 0/0:22,0:22:66:0,66,579 0/0:19,0:19:54:0,54,810 0/0:19,0:19:54:0,54,810 0/0:22,0:22:63:0,63,945 0/0:17,0:17:45:0,45,675 0/0:20,0:20:60:0,60,534 0/0:12,0:12:36:0,36,323 0/0:16,0:16:42:0,42,630 0/0:18,0:18:45:0,45,675 0/0:20,0:20:54:0,54,810 0/0:23,0:23:63:0,63,945 0/0:7,0:7:21:0,21,188 0/0:18,0:18:51:0,51,765 0/0:9,0:9:27:0,27,241 0/0:16,0:16:48:0,48,439 0/0:21,0:21:51:0,51,765 0/0:13,0:13:30:0,30,450 0/0:11,0:11:33:0,33,301 0/0:15,0:15:42:0,42,630 0/0:5,0:5:12:0,12,180 0/0:10,0:10:27:0,27,405 0/0:22,0:22:57:0,57,855 0/0:10,0:10:30:0,30,268 0/0:9,0:9:21:0,21,315 0/0:18,0:18:42:0,42,630 0/0:9,0:9:24:0,24,360 0/0:10,0:10:27:0,27,405 0/0:12,0:12:33:0,33,495 0/0:21,0:21:57:0,57,855 0/0:7,0:7:18:0,18,270 0/0:23,0:23:60:0,60,900 0/0:16,0:16:48:0,48,433 0/0:18,0:18:48:0,48,720 0/0:9,0:9:27:0,27,241 0/0:11,0:11:33:0,33,269 0/0:22,0:22:57:0,57,855 0/0:22,0:22:60:0,60,900 0/1:7,7:.:99:185,0,123 0/0:15,0:15:42:0,42,630 0/0:16,0:16:39:0,39,585 0/0:21,0:21:54:0,54,810 0/0:13,0:13:36:0,36,540 0/0:14,0:14:39:0,39,585 0/0:14,0:14:42:0,42,380 0/0:12,0:12:33:0,33,495 0/0:8,0:8:24:0,24,199 0/0:10,0:10:27:0,27,405 0/0:36,0:36:93:0,93,1395 0/0:8,0:8:24:0,24,170 0/0:10,0:10:30:0,30,248 0/0:13,0:13:39:0,39,305 ./.:0,0:0 0/0:24,0:24:69:0,69,1035 0/0:17,0:17:42:0,42,630 0/0:12,0:12:36:0,36,298 0/0:32,0:32:90:0,90,1350 0/0:22,0:22:66:0,66,535 0/0:9,0:9:27:0,27,216 0/0:17,0:17:42:0,42,630 0/0:17,0:17:39:0,39,585 0/0:11,0:11:30:0,30,450 0/0:18,0:18:54:0,54,417 0/0:11,0:11:27:0,27,405 0/0:17,1:18:40:0,40,404 0/0:22,0:22:63:0,63,945 0/0:6,0:6:18:0,18,162 0/0:14,1:15:34:0,34,351

VQSLOD

$
0
0

We are doing WES on hundreds of samples and the average sequencing depth is 60X. I used the sensitivity 99.9% for the PASS filter in VQSR. Attached is the histagram of VQSLOD for around 900,000 SNPs with PASS filter and call rate > 95%. I wonder if you can help me with two questions?
1. is it normal to have half of the SNPs with VQSLOD < 0?
2. why so few SNPs around VQSLOD = 0?

"No data found" error came up when performing VQSR on SNPs

$
0
0

Hi,

I am doing VQSR with -mode SNP, and it came up with a error saying "No data found". Here is my command line:

/path/to/java -Xmx7g -Djava.io.tmpdir=pwd/tmp -jar /path/to/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T VariantRecalibrator -R /path/to/Data/bundle_2.8_hg19/ucsc.hg19.fasta -input /path/to/VQSR_0722/All8sample.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/Data/bundle_2.8_hg19/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /path/to/Data/bundle_2.8_hg19/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /ifshk1/BC_MD/GROUP/Workflow/Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile /path/to/VQSR_0722/All8sample.snp.VQSR.recal -tranchesFile /path/to/VQSR_0722/All8sample.snp.VQSR.tranches -rscriptFile /path/to/VQSR_0722/All8sample.snp.VQSR.plot.R

I noticed a line in the log file that saying "INFO 08:08:09,495 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.". This is followed by the error message given below:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:399)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:143)
at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.

I used to have error message of "No data found" when performing VQSR with -mode INDEL, and I know it was due to the fact that there were not enough indel variants overlapped with the training data. But this time the mode is SNP. I counted the total for SNPs in my dataset (awk 'length ($4)==1 && length ($5)==1' /path/to/VQSR_0722/All8sample.vcf | wc -l) and there are 184,540 SNPs. I wonder if it is the same cause as that for INDEL VQSR error?

Thank you!

Emma

Is it necessary to perform additional quality filter to remove low quality reads and barcode contami

$
0
0

Hi dear all,
I went through the whole variant calling pipeline on my whole exome sequencing data.Now I have three questions here.Q1. Is it necessary to perform additional quality filter to remove low quality reads and barcode contamination before mapping? As there are dedupping and BQSR in downstream steps, can I assume that the effect brought by low quality bases and barcode contamination will be eliminated in downstream steps? Q2. Is it better to do joint calling than do variant calling individually? We aim to find pathological mutations by comparing SNPs between the affected and the normal in one family. For each family, we have data sets from 3-4 individuals. I marked each individual with different @RG tags. In my first trial, I just used the basic command calling SNPs one sample a time. I learned that VCF mode accepts multiple bam files. I can type -I No1.bam -I No2.bam -I No3.bam -I .... But gVCF mode only accepts one bam file a time. So I should merge multiple bams using 'printreads' before using 'HaplotypeCaller'. My confusion is that 'BaseRecalibrator' only accepts one bam file and output one BQSR table a time. So should I 'cat' all tables and use as -BQSR for 'printreads'? Which will be better? Still use VCF mode by inputting multiple bam files at a time or merge multiple bam files in advance and do gVCF calling? Q3.Should I use hard filters instead of VQSR? Though we are working on whole exome data, we are analyzing less than 30 samples a time. I saw in one of your answers that the minimum sample number should reach 30 to fit gaussian model.Though no error was reported when I ran VQSR in my first trial, the Ti/Tv value came out to be bad in my tranches files and model plots seemed different from your example in the best practice. So I think maybe I should just use hard filters then?

Does VQSR filter out known variants with low score?

$
0
0

Does VQSR filter out known variants with low score or preserve them?

How does the "-aggregate" argument in VariantRecalibrator compare to more samples genotyped together

$
0
0

Dear GATK-team,

I've tried to search for the answer to this question on the guidelines and forums pages, but I haven't been able to figure it out. I apologize if I'm missing something that should be obvious from the documentation.

So, I'm familiar with the current best practices for DNA-seq variant discovery with HC, call GVCFs and VQSR, and the requirement to have ample data for building the model in VQSR. To get enough data, one might add in extra variants, which you recommend doing in the CALLING stage.

I have a "ploidy 20"-dataset of several hundred samples where calling for practical computational purposes needs to be done in batches to avoid memory crash. But I'd nevertheless like to use all the variants for optimal VQSR. It looks like this might be done with the --aggregate argument in VariantRecalibrator by adding in raw VCFs from all batches in that stage. Would this really differ significantly from a workflow where all samples were called together? Why is the "--aggregate" option never mentioned in your advice on how to achieve a VQSR-worthy dataset?

Thanks for a great resource and website
Best regards
Lasse

General variant detection pipeline

$
0
0

I'm a bit uncertain as to the optimal pipeline for calling variants. I've sequenced a population sample of ~200 at high coverage ~30X, with no prior information on nucleotide variation.

The most rigorous pipeline would seem to be:
1. Call variants with UG on 'raw' (realigned) bams.
2. Extract out high-confidence variants (high QUAL, high DP, not near indels or repeats, high MAF)
3. Perform BQSR using the high-confidence variants.
4. Call variants with HaplotypeCaller on recalibrated bams.
5. Perform VQSR using high-confidence variants.
6. Any other hard filters.

Is this excessive? Does using HaplotypeCaller negate the use of *QSR? Is it worthwhile performing VQSR if BQSR hasn't been done? Otherwise I'm just running HaplotyperCaller on un-recalibrated bams, and then hard-filtering.


DP parameter VQSR for Targeted Exome Sequencing

variants filtering VQSR

$
0
0

Hi,

I ran VQSR with 30 samples and tranche filter level of 99.0 for both SNPs and INDELs. Around 82% of my variants call pass the filter. May I know is there any standard that can be used to evaluate how good the filtering result is?

Besides, if I split the gvcfs by chromosomes and run joint genotyping (GenotypeGVCFs) at chromosome level, is it going to affect the result compared to running at whole genome level?

Thanks,
jf

VQSR tag in FILTER is false positive variant ?

$
0
0

Hi.
I wonder whether after VQSR , the variants which have VQSR tag in FILTER column filter out, including VQSRTrancheINDEL99.00 to 99.90, VQSRTrancheINDEL99.90 to 100.00+, VQSRTrancheINDEL99.90 to 100.00, VQSRTrancheSNP99.90 to 100.00, VQSRTrancheSNP99.90 to 100.00+? These means are false positive variants?
and What do 100.00+ mean?
Can I have Only "PASS" variants tin FILTER tab as true positve variant ?

information about GATK "Filter" field

$
0
0

Dear doctor,
I'm analyzing the vcf files obtained by applying your GATK variant calling tool. I'm new in this kind of analysis, and I would like to kindly ask you some additional information about the filtering step of the VCF file according to the "FIlter" field. In details, I read that variants that are above the defined FILTER VQSLOD threshold , pass the filter, so the FILTER field will contain PASS, while variants that are below the threshold will be filtered out; however, they are written to the output file, but in the filter field they have the name of the tranche they belonged to. I also read about the tranches, corresponding to different levels of sensitivity and accuracy of the variants. In my VCF file, I have 4 groups of variants according to the "Filter" field:
- LowQual
-PASS
-VQSRTrancheINDEL99.00to99.90
-VQSRTrancheSNP99.00to99.90
-VQSRTrancheSNP99.90to100.00

I can't fully understand the meaning of the tranches, and the probability of false positives for each one. I would like to kindly ask you how I can interpret the three categories defined as "VQSRTranche", and if you suggest to remove or not these variants for subsequent analyses, and consider only the "PASS" filter.
Thank you in advance for your time and kind attention.
Best regards,

Paola Orsini

VQSR with custom annotations?

$
0
0

Is it possible to run VQSR with custom annotations? I added some additional columns (not produced by GATK) to the INFO field of my VCF file, and included them with the "-an" flag to VariantRecalibrator. The VariantRecalibratorEngine reports convergence, but then I get this:

INFO  12:55:25,499 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
##### ERROR stack trace
java.lang.IllegalArgumentException: No data found.
        at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
        at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:408)
        at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:156)
        at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

If I run VariantRecalibrator without the custom annotations but everything else the same, it works. I'm using GATK v3.4-46-gbc02625.

Developing truth sets for VQSR

$
0
0

I am working with macaque WGS and WES data and I'm trying to implement VQSR effectively. In the NHP genomics community there aren't large databases of validated true variant so we're attempting to create our own sets using high quality amplicon and genotyping data that exists within our own lab. My question is how many variants is VQSR expecting to create an effective gaussian mixture model? Do you have any advice as to how to develop a reasonably sized truth set in the absence of databases like dbsnp, hapmap, ect?

Thank you,
Trent

When VQSR run using Whole genome sequencing data.

$
0
0

Hi.
I ran GATK(version GATK3.4-46) using whole genome sequencing data of 60 samples.
Using bwa, I aligned my WGS data to human reference genome (GRCh 37) including autosome, X, Y, MT, and GL_****.
I ran HaplotypeCaller, and then using GenotypeGVCFs, VCF was extracted for chr1 - chr22, chrX, chrY.
I wonder what do mean WGS that GATK' Bestpractices recommend.
when GATK (VQSR, HC et al) was used,** to get the best result**, should WGS be consisted of autosome, X, Y, MT and GL_*****, ?
Or could it contain only the autosome excluding X ,Y?


Strange VQSR plots after version 3.5 release

$
0
0

I ran GATK's cohort genotyping pipeline on 5000 human samples with Illumina WGS ~1.3x data, up through GenotypeGVCFs (and CatVariants to combine chunks) using v3.4-46. Next I ran VariantRecalibrator (initially just chr1) using recommended settings with both v3.4-46 and v3.5. Here is my command for both versions:

java -Xmx40g \
-jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R hs37m.fa \
-input gatk.hc.combined.genotyped.chr1-22.vcf.gz \
-recalFile snps.recal \
-tranchesFile snps.tranches \
-rscriptFile recalibrate_SNP_plots.R \
--target_titv 2.15 \
-nt 24 \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
-mode SNP \
-L 1 \
-tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 98.5 -tranche 90.0 \
--maxGaussians 6 \
-log VariantRecalibrator.snps.log

The attached tranches plots (snps.tranches.v3.5.pdf) generated w/ v3.5 look strange because:

1) The tranches are out of order on the bar plot (e.g., 99.5 is before 99)
2) The fill coloring doesn't make sense for tranches 99 and 98.5 - there are orange stripes over the blue bar
3) The scatter plot's connecting lines go in both directions

The plots for v3.4-46 look more normal (snps.tranches.v3.4-46.pdf), though I'm still trying to figure out how to get closer to the expected 2.15 Ti/Tv ratio. Oddly, the Ti/Tv ratios differ slightly between v3.4-46 and v3.5 even though the same data and settings were used.

I suspected the behavior w/ v3.5 may be a possible bug in VariantRecalibrator, which is why I'm posting here. Please let me know if you need any more information.

My best,

Chris

VQSR Annotations

$
0
0

I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.

For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.

In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.

VQSR INDEL output error

$
0
0

Hi,
I am new to GATK, I have been trying to figure a strange error that I haven't been able to resolve for days.

Process so far.
1. Run UnifiedGenotyper per chr using -L option on ~ 130 samples
2. Merge all output vcf files into one. (using tabix to gz and index each vcf file, then use vcf-concat to merge all chr* files)
3. Use a perl script to sort merged vcf file based on the reference file order. i.e (chr1, 2, 3...M)
4. Split Merged.sorted.vcf file into INDEL and SNV files.
5. Run VQSR on each file (SNV and INDEL).

Error that I get:
During ApplyRecalibration for INDELs I get an error in chr9 that states that a coordinate A is after Coordinate B (A < B, and A and B are different values, each time). This always happens in chr9. I checked my input Merged.sorted.indel.vcf file around coordinate A and B and its file is in order. I checked the recal file and it is also in order. So I can't figure out where the error is coming from. The strange thing is that error is reported when GATK is creating the output file, not during its computation/applying recalibration.

Has anyone encountered such a situation before?  Or  have any ideas I should try to resolve the error.  I don't get any errors with SNVs only INDEL's

Exact error message:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to merge temporary Tribble output file.
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:259)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:103)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:248)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)
Caused by: org.broad.tribble.TribbleException$MalformedFeatureFile: We saw a record with a start of chr9:33020249 after a record with a start of chr9:34987121, for input source: /data2/bsi/secondary/multisample/Merged.variant.filter.INDEL_2.vcf
at org.broad.tribble.index.DynamicIndexCreator.addFeature(DynamicIndexCreator.java:164)
at org.broadinstitute.sting.utils.codecs.vcf.IndexingVCFWriter.add(IndexingVCFWriter.java:118)
at org.broadinstitute.sting.utils.codecs.vcf.StandardVCFWriter.add(StandardVCFWriter.java:163)
at org.broadinstitute.sting.gatk.io.storage.VCFWriterStorage.mergeInto(VCFWriterStorage.java:120)
at org.broadinstitute.sting.gatk.io.storage.VCFWriterStorage.mergeInto(VCFWriterStorage.java:26)
at org.broadinstitute.sting.gatk.executive.OutputMergeTask.merge(OutputMergeTask.java:48)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:253)
... 6 more

ERROR ------------------------------------------------------------------------------------------

Exact command:

/usr/java/latest/bin/java -Xmx6g -XX:-UseGCOverheadLimit -Xms512m -jar /projects/apps/alignment/GenomeAnalysisTK/latest/GenomeAnalysisTK.jar -R /data2/reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /projects/apps/alignment/GenomeAnalysisTK/latest/Hossain.Asif_mayo.edu.key -mode INDEL -T ApplyRecalibration -nt 4 -input /data2/secondary/multisample/Merged.variant.INDEL.vcf.temp -recalFile /data2/secondary/multisample/temp/Merged.variant.INDEL.recal -tranchesFile /data2/secondary/multisample/temp/Merged.variant.INDEL.tranches -o /data2/secondary/multisample/Merged.variant.filter.INDEL_2.vcf

Version of GATK : 1.7 and 1.6.7

Does setting the -out_mode flag in UnifiedGenotyper differently has different effect when doing VQSR

$
0
0

Hello,

Does VQSR behave differently when the -out_mode flag in UnifiedGenotyper is set to EMIT_VARIANTS_ONLY as compared to EMIT_ALL_CONFIDENT_SITES. I think by using EMIT_ALL_CONFIDENT_SITES we might give VQSR more information to train the model, but I may be wrong. Can someone please help me with this ? Thanks.

cheers,
Rahul

VQSR Training

$
0
0

When performing VQSR, the data set has its variants overlapped with the training set, may I know if all the overlapped variants are used in the training or is it down sampled?

Viewing all 326 articles
Browse latest View live