Consider this minimal VCF file which contains a multi-allelic call made by a somatic variant caller on a human tumor sample:
##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT foo
chr10 43613843 . GC TC,CC,TT . clustered_events;multiallelic DP=5180 GT:AD:AF:DP:F1R2:F2R1:SB 0/1/2/3:0,5154,11,15:0.994,0.00229,0.00306:5180:0,2507,4,8:0,2576,6,5:0,0,2585,2595
So, we have a site with three ALT alleles and a genotype of 0/1/2/3
.
The VCF specification has this to say about the genotype field:
GT (String): Genotype, encoded as allele values separated by either of/or|. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
As I understand the VCF specification, given the ALT field TC,CC,TT
, the genotype of 0/1/2/3
is telling me that the reference sequence is absent but there are heterozygous calls for each of the three variants in ALT. This makes sense because this is a tumor sample, so presumably what we have here is three populations of cells, some with the GC=>TC change, some with GC=>CC and some with GC=>TT. Given the very high support for the GC=>TC change, I would assume this is actually a germline variant and there are small populations of tumor cells in my sample with either the GC=>CC or the GC=>TT variant. Alternatively, and quite likely, the latter two variants are sequencing artefacts.
Regardless, I need to split these into one variant per line. As I understand the specification, after splitting, I should be left with three variants and these specific genotypes (I am showing only the genotype here, for simplicity):
chr10 43613843 . GC TC GT 1/0
chr10 43613843 . GC CC GT 1/0
chr10 43613843 . GC TT GT 1/0
Is this actually correct? Unfortunately, standard tools like bcftools
produce something different:
$ bcftools norm -m -any foo.vcf 2>/dev/null | grep chr10 | awk '{print $1,$2,$3,$4,$5,$9,$10}'
chr10 43613843 . GC TC GT:AD:AF:DP:F1R2:F2R1:SB 0/1/0/0:0,5154:0.994:5180:0,2507:0,2576:0,0,2585,2595
chr10 43613843 . GC CC GT:AD:AF:DP:F1R2:F2R1:SB 0/0/1/0:0,11:0.00229:5180:0,4:0,6:0,0,2585,2595
chr10 43613843 . GC TT GT:AD:AF:DP:F1R2:F2R1:SB 0/0/0/1:0,15:0.00306:5180:0,8:0,5:0,0,2585,2595
Here, we have sites with a single ALT variant but which have multiple genotypes and that doesn't make sense to me. How can we have a 0/0/0/1
genotype and only one ALT allele? Is bcftools
assuming this is a triploid genome? Is this a bug in bcftools
or am I misunderstanding the VCF specification? What is the correct way to display the genotype after splitting this kind of multi-allelic call?
=====
This question has also been asked on Bioinformatics Stack Exchange.