I am trying to understand the per sample genotype data given in the following file from 1000 genomes:
ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
I start by explaining what I see, and then I have two questions at the end.
What I see is a FORMAT column with the entry:
GT:DS:GL
with GT and GL defined as in the VCF definition, with DS being a custom format defined as:
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">
One of the several things that I am not understanding is how the genotype call is being made in some cases.
For example, there are many entries in which the GL values are -0.48,-0.48,-0.48
My understanding is that then the read data does not allow a call between the reference or the variant allele at such a bialletic.
However, the DS reading seems to be being used. Specifically, looking across the whole file, I run the following awk script:
BEGIN { FS="\t" }
$1=="#CHROM"{ for (i=10;i<=NF;i++) {
subject[i]=$i
}
next
}
{ for (i=10;i<=NF;i++) {
if ($i ~ /:-0.48,-0.48,-0.48/) {
if ( done[$i] == 0 ) {
done[$i] = 1
print $i , NR, subject[i]
}
}
}
}
This is executed:
gunzip -c ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz |
awk -f gl.awk
Shows DS values from 0.000 to 2.000 with granularity of 0.050. As far as I can tell this value, and some other data, seems to allow a genotype call to be made.
What I see, is that if the DS value lies between 0.000 and 0.500 then the phased genotype is called as 0|0, e,g.
0|0:0.150:-0.48,-0.48,-0.48 39 HG00140
If the DS value lies between 0.550 and 1.000 then the genotype is called as one of 0|0, 1|0, or 0|1 e.g
1|0:0.850:-0.48,-0.48,-0.48 34 HG01168
0|0:0.850:-0.48,-0.48,-0.48 34 HG00242
0|1:0.850:-0.48,-0.48,-0.48 34 HG00143
If the DS value is between 1.050 and 1.500 then the genotype is called as one of 1|1, 1|0, or 0|1 e.g.
0|1:1.200:-0.48,-0.48,-0.48 31 NA19108
1|1:1.200:-0.48,-0.48,-0.48 32 HG01069
1|0:1.200:-0.48,-0.48,-0.48 34 NA12383
If the DS value is between 1.550 and 2.000 then the genotype is called as 1|1.
My two questions are:
When the DS lies between 0.5 and 1.5 and it does not indicate definitively either 0|0 or 1|1 on what basis is the genotype call made, and by which piece of software?
What is this DS value anyway, and how does it allow the 0|0 and 1|1 calls to be made? "Genotype dosage from MaCH/Thunder" is somewhat cryptic.
This is correct, as the 1000genomes is low coverage all positions do not share equal coverage in all individuals so imputation and haplotype inference is needed to give every individual a genotype are every position
Dosage isn't necessarily 0, 1 or 2 but can be a continum as it shows how likely the genotype is on the basis of the imputation evidencehttp://www.1000genomes.org/faq/what-does-genotype-dosage-mean-phase1-integrated-call-set
Thanks, this does answer the first of my two questions. The calls appear to be made on the basis of haplotype information