Hi,
I want to calculate statistics of missing data per each site in my vcf file.
Using vcftools --missing-site
gives wrong stats for several sites.
Is there is any other way to calculate it?
Thank you!
I have 36 samples and here is an example of the vcftools --missing-site
output for four positions below
chr1 10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C 194.49 PASS AC=3;AF=0.083;AN=36;DB;DP=132;ExcessHet=0.1902;FS=0;InbreedingCoeff=0.4307;MLEAC=5;MLEAF=0.139;MQ=38.12;MQRankSum=-0.842;QD=24.31;SOR=1.329;VQSLOD=-6.944;culprit=MQ GT:AD:DP:GQ 0/0:2,0:2:6 ./.:0,0:0:. ./.:0,0:0:. ./.:0,0:0:. 0/0:5,0:5:15 0/0:2,0:2:3 ./.:0,0:0:. 1/1:0,3:3:10 ./.:0,0:0:. ./.:0,0:0:. 0/0:9,0:9:24 0/0:11,0:11:27 ./.:0,0:0:. ./.:0,0:0:. 0/0:9,0:9:15 0/0:19,0:19:54 0/0:4,0:4:12 0/0:9,0:9:2 ./.:0,0:0:. 0/0:9,0:9:9 0/0:8,0:8:18 0/0:8,0:8:24 0/0:6,0:6:15 ./.:0,0:0:. ./.:0,0:0:. 0/0:2,0:2:3 0/0:6,0:6:6 ./.:0,0:0:. ./.:0,0:0:. ./.:0,0:0:. 0/0:8,0:8:24 ./.:0,0:0:. 0/1:3,2:5:76 ./.:0,0:0:. ./.:0,0:0:. ./.:2,0:2:.
chr2 120680534 rs74654922 A ATGTGTGTG 12887.7 PASS AC=7;AF=0.175;AN=40;DB;DP=751;ExcessHet=3.0103;FS=0;InbreedingCoeff=0.5714;MLEAC=12;MLEAF=0.3;MQ=58.52;NEGATIVE_TRAIN_SITE;QD=29.16;SOR=0.883;VQSLOD=-1.646;culprit=MQ GT:AD:DP:GQ 1|1:0,10:12:30 0|0:0,1:17:48 .|.:0,2:16:. 0|0:0,1:21:60 0/0:0,1:6:16 ./.:38,0:38:. 0|0:0,0:24:72 ./.:21,0:21:. 0/0:0,0:13:39 .|.:0,1:23:. 0/0:0,2:11:48 0|0:0,1:22:63 .|.:0,0:18:. ./.:20,0:20:. ./.:15,0:15:. 1/0:0,12:30:99 0|0:0,0:12:36 1|1:0,19:20:57 .|.:0,1:25:. ./.:22,0:22:. 0|0:0,0:19:57 0|0:0,0:13:39 .|.:0,1:16:. 0|0:0,0:26:78 ./.:26,0:26:. 0|0:0,0:29:87 0|0:0,1:29:84 1|1:0,19:21:57 .|.:0,0:23:. 0|0:0,2:23:63 ./.:14,0:14:. 0|0:0,0:20:60 ./.:11,0:11:. 0|0:0,0:15:45 ./.:32,0:32:. ./.:35,0:35:.
chr22 50808269 . AG A 105.55 PASS AC=2;AF=0.059;AN=34;DP=1217;ExcessHet=0.1296;FS=0;InbreedingCoeff=0.1633;MLEAC=7;MLEAF=0.206;MQ=36.17;NEGATIVE_TRAIN_SITE;QD=30.35;SOR=2.584;VQSLOD=-3.491;culprit=MQ GT:AD:DP:GQ ./.:22,0:22:. 0/0:38,0:38:0 ./.:24,0:24:. ./.:20,0:20:. ./.:26,0:26:. ./.:43,0:43:. ./.:26,0:26:. ./.:19,0:19:. 0/0:30,0:30:0 ./.:26,0:26:. ./.:27,0:27:. 0/0:43,0:43:0 0/0:22,0:22:0 ./.:31,0:31:. ./.:28,0:28:. 0/0:34,0:34:0 ./.:26,0:26:. 1/1:0,2:7:7 0/0:51,0:51:0 0/0:33,0:33:0 ./.:21,0:21:. ./.:9,0:9:. 0/0:22,0:22:0 0/0:28,0:28:0 0/0:41,0:41:0 0/0:61,0:61:0 ./.:54,0:54:. 0/0:43,0:43:0 ./.:36,0:36:. 0/0:52,0:52:0 ./.:17,0:17:. 0/0:51,0:51:0 ./.:19,0:19:. 0/0:63,0:63:0 ./.:34,0:34:. 0/0:43,0:43:0
chr18 67740193 . C CTTTTTTTTTTTTTT 8423.06 PASS AC=9;AF=0.196;AN=46;DP=616;ExcessHet=0;FS=0;InbreedingCoeff=0.618;MLEAC=10;MLEAF=0.217;MQ=57.96;NEGATIVE_TRAIN_SITE;QD=26.16;SOR=0.956;VQSLOD=-2.121;culprit=MQ GT:AD:DP:GQ .|.:0,4:15:. .|.:0,5:15:. 1/1:0,2:5:42 .|.:0,2:14:. 1/1:0,5:6:0 0/0:29,0:29:0 0/0:15,0:15:6 0/0:0,1:6:14 0/0:13,0:13:0 .|.:0,4:12:. 0|0:0,6:26:60 .|.:0,3:11:. .|.:0,1:6:. 0/0:0,1:4:20 1|1:0,20:25:60 0/0:18,0:18:0 .|.:0,4:15:. .|.:0,4:18:. 0/0:29,0:29:12 0/0:0,5:15:90 0/0:0,2:6:18 0/0:0,1:6:38 .|.:0,4:15:. 0|0:0,1:17:46 0/0:0,3:9:44 .|.:0,6:12:. 0|0:0,1:18:51 0/0:21,0:21:6 .|.:0,8:16:. .|.:0,5:17:. .|.:0,4:14:. 1/1:0,1:1:0 0|0:0,2:27:75 1/0:0,1:5:38 0/0:24,0:24:15 0/0:16,0:16:9
And here is the output of vcftools --missing-site
CHR POS N_DATA N_GENOTYPE_FILTERED N_MISS F_MISS
chr1 10616 72 0 36 0.5
chr2 120680534 66 0 26 0.393939
chr22 50808269 72 0 38 0.527778
chr18 67740193 59 0 13 0.220339
When in reality it should be this
chr1 10616 72 0 36 0.5
chr2 120680534 72 0 32 0.44
chr22 50808269 72 0 38 0.527778
chr18 67740193 72 0 26 0.361
Why do you think it is wrong with
vcftools --missing-site
exactly ?Made edits in the question with examples
Thanks for elaborating. Looks like there is a bug, and the tool is counting only half of the missing allele for unphased genotypes (indicated with
.|.
). It is counting properly for the phased genotype (./.
) however. I don't know if there is another tool to do this, but you could try to update vcftools in case the issue got fixed in a newer version. It should also be relatively easy to write a small script that gives you the proper answer.Yes, the thing that it does not give a right calculation even for unphased data, so DO NOT use vcftools for missing site calculations :)
I might consider awk instead
Crossposted on stackoverflow.