I've generated a bunch (70) of VCF files containing SNP data for individual humans (diploid, one sample per file). I've also merged all 70 samples into a single BCF file.
The issue is; when I ran bcftools stats
I got some unusual (I think impossible) Allele Frequency data (based on what I see directly in the file).
Heres a line from both the individual sample, and the compined samples. These are indicative of the entire file.
Sample: Individual, diploid (human)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample-63
1 12383 . G A 36.9773 . DP=76; VDB=0.320743; SGB=-0.693145; RPB=0.870155; MQB=0.979967; MQSB=0.839692; BQB=0.935797; MQ0F=0.0526316; ICB=1; HOB=0.5; AC=1; AN=2; DP4=19,14,21,20; MQ=4 GT:PL 0/1:67,0,5
Sample: 70 Individuals, diploid (human)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6 Sample-7 Sample-8 Sample-9 Sample-10 Sample-11 Sample-12 Sample-13 Sample-14 Sample-15 Sample-16 Sample-17 Sample-18 Sample-19 Sample-20 Sample-21 Sample-22 Sample-23 Sample-24 Sample-25 Sample-26 Sample-27 Sample-28 Sample-29 Sample-30 Sample-31 Sample-32 Sample-33 Sample-34 Sample-35 Sample-36 Sample-37 Sample-38 Sample-39 Sample-40 Sample-41 Sample-42 Sample-43 Sample-44 Sample-45 Sample-46 Sample-47 Sample-48 Sample-49 Sample-50 Sample-51 Sample-52 Sample-53 Sample-54 Sample-55 Sample-56 Sample-57 Sample-58 Sample-59 Sample-60 Sample-61 Sample-62 Sample-63 Sample-64 Sample-65 Sample-66 Sample-67 Sample-68 Sample-69 Sample-70
1 12383 . G A 115 PASS VDB=0.350935; SGB=-0.693147;RPB=0.996445; MQB=0.597159; MQSB=0.155977; BQB=0.983101; MQ0F=0.00813008; ICB=1; HOB=0.5; MQ=8; DP=2869; DP4=569,817,659,762; AN=56; AC=31 GT:PL 0/1:110,0,39 0/1:58,0,7 ./.:. ./.:. ./.:. 0/1:84,0,78 ./.:. 0/1:95,0,0 ./.:. ./.:. ./.:. 0/1:148,0,23 ./.:. ./.:. ./.:. 0/1:88,0,12 0/1:77,0,92 0/1:93,0,12 0/1:74,0,67 0/1:88,0,13 ./.:. ./.:. 0/1:70,0,49 0/1:69,0,61 1/1:128,14,0 ./.:. ./.:. ./.:. ./.:. 0/1:72,0,12 ./.:. 0/1:68,0,15 0/1:88,0,46 ./.:. ./.:. ./.:. 0/1:113,0,60 ./.:. ./.:. ./.:. 1/1:93,1,0 ./.:. ./.:. 0/1:82,0,80 ./.:. ./.:. ./.:. ./.:. ./.:. ./.:. ./.:. ./.:. ./.:. 0/1:97,0,42 ./.:. 0/1:53,0,16 ./.:. 0/1:75,0,28 0/1:85,0,47 ./.:. 0/1:111,0,121 1/1:99,20,0 0/1:67,0,5 0/1:62,0,0 ./.:. 0/1:72,0,55 ./.:. ./.:. ./.:. ./.:.
The important thing of note here is; both the individual file, and the 70-samples file have been filtered for SNPs only (so, the only GT data is either 0/X
or X/X
...but no 0/0
). There are the ./.:.
data which (based on my understanding) means the individual file was REF/REF for that point - and since only SNPs are being captured (NOT 0/0)...there is no data
for that sample at POS=12383. Also note the AN=56
and the AC=31
8in the combined file line).
The definition of AN
(as I understand it, as confirmed by the lines in MY files) is that AN is the sum of all the IDENTIFIED alleles in a line, regardless of whether they were REF or ALT. So 0/1 (AN=2)
and 1/1 (AN=2)
. AC
is all the IDENTIFIED REF in a line. So 0/1 (AC=1)
and 1/1 (AC=2)
. AF
(Allele frequency) is AC/AN
.
So, (again, as I understand it) in the individual file The AN will ALWAYS be 2 (since REF/REF or 0/0 were all filtered out). And the AC can ONLY be 1 or 2 (not three unless trisomy, and not 0 because filtered).
Also notice in the combined file, the ./.:.
refers to a sample that was REF/REF (and thus not included in the file). But if you look at the AC
and AN
in the combined line, the ./.:.
are NOT included in those counts.
QUESTIONS:
(one) Since, in the individual file(s), AN=2 and AC > 0
- it is impossible for the AF to be less than 0.5 (Q1: correct?)
(two) Since ALL the data in the combined file comes from similarly filtered individual files, the same is true for the combined file (AF cannot be less than 0.5) (Q2: Correct?).
It IS possible for the AN of a single line (70 samples) to be as high as 140 (70 samples x 2 chromosomes)...however (since the filter excludes 0/0) then the minimum AC would be 70. No matter how you sliced it, ```AF >= 0.5````
(three) So, a quick glance at the output of bcftools stats
tells me something's wrong, because theyre claiming I have a number of SNPs at AF=0.1
, AF=0.11
, AF=0.12
The question is; based on the above, where/how is it calculating that AF
? Am I doing it wrong, is bcftools
doing it wrong (based on the AN
and AC
in the lines themselves), or is the stats portion doing something else?
# AF, Stats by non-reference allele frequency:
# AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
AF 1 0.100000 374 137 237 0 0 0 0
AF 1 0.110000 652 241 411 0 0 0 0
AF 1 0.120000 612 224 388 0 0 0 0
AF 1 0.130000 467 157 310 0 0 0 0
AF 1 0.140000 734 272 462 0 0 0 0
AF 1 0.150000 407 140 267 0 0 0 0
AF 1 0.160000 787 274 513 0 0 0 0
AF 1 0.170000 340 115 225 0 0 0 0
AF 1 0.180000 526 152 374 0 0 0 0
AF 1 0.190000 880 334 546 0 0 0 0
AF 1 0.200000 313 115 198 0 0 0 0
AF 1 0.210000 451 156 295 0 0 0 0
(...snip...)
AF 1 0.480000 1020 396 624 0 0 0 0
AF 1 0.490000 8252448 5590474 2661974 0 0 0 0
AF 1 0.500000 96973 65778 31195 0 0 0 0
AF 1 0.510000 316516 214884 101632 0 0 0 0
AF 1 0.520000 103985 68975 35010 0 0 0 0
AF 1 0.530000 67025 43946 23079 0 0 0 0
AF 1 0.520000 103985 68975 35010 0 0 0 0
AF 1 0.530000 67025 43946 23079 0 0 0 0
AF 1 0.540000 50260 32635 17625 0 0 0 0
AF 1 0.550000 50590 32823 17767 0 0 0 0
AF 1 0.560000 31551 20233 11318 0 0 0 0
AF 1 0.570000 31220 19894 11326 0 0 0 0
AF 1 0.580000 13029 8194 4835 0 0 0 0
AF 1 0.590000 26025 16457 9568 0 0 0 0
AF 1 0.600000 12164 7604 4560 0 0 0 0
(...snip...)
AF 1 0.970000 529 318 211 0 0 0 0
AF 1 0.980000 73 40 33 0 0 0 0
AF 1 0.990000 169643 100968 68675 0 0 0 0
I had typed out an answer but Kevin beat me (and did a better job). I think it's probably still worth providing a link to the VCF documentation for future reference on what the fields and terms within the VCF file refer to.
Hey Russ, my apologies. I think and work quite rapidly. Please feel free to provide your answer if you still wish. It's good to get all opinions/ideas out there.
No need to apologize! You provided a nice succinct answer, my additions would be redundant :) Cheers.