Hello,
I'm interested in comparing the genotypes from Genome in a Bottle's NA12878 (GIAB) to those of her parents (NA12891 and NA12892).
I downloaded GIAB's NA12878 vcf from here.
After a lot of searching around, I found this page from Broad describing a vcf containing the genotypes for the trio.
And I downloaded the variants for NA12891 and NA12892 from here.
In total, there are ~3.3 million variants in the GIAB vcf. I compared the alternate alleles and genotypes in the GIAB vcf with the corresponding values in her parents and found that ~27% of the positions had parental genotypes that didn't make sense.
e.g. a position in the daughter is genotyped as 1/1, but the father is 0/1 and the mother is 0/0. That is, it's impossible for the daughter to be 1/1 if her parents are 0/1 and 0/0.
I'm aware that the GIAB vcf has gone through a lot more curation than those of her parents, so perhaps that accounts for the discrepancy?
I'm pretty sure I'm using the correct files, but if anyone thinks otherwise, please let me know.
Thank you.
You need to make sure to restrict your analyses to the high confidence regions provided in the NIST bed file.
According to the README from Genome in a Bottle, the VCF contains highly confident hetero- and homozygous variant calls, thus implying that those variants are in highly confident regions. Any position in the confident BED file but not the VCF can be confidently treated as homozygous reference.
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/README.GIAB.v0.2.txt
As a quick sanity check, I previously used bedtools to confirm that there were zero positions in the VCF that were not in the confident BED file.