I used tabix to download the data and vcftools to convert to a matrix of 0, 1, 2 and -1 for missing genotype.
I'm interested in SNPs in the exons of a particular gene, the thing is, of 13,000 SNPs that there is data, there are a huge number of missing values, nearly 5,000 for which genotypes is missing.
Does anyone have any idea why there are so many missing genotypes in this data?
The 20100804 data has missing genotypes due to the way it was created
The set itself was a naive 2 of 4 intersection of 4 input call sets, only 2 of these 4 sets had genotypes associated with them, Broad and UMich, the Broad genotype set was phased and used LD info so was felt to be better so any snp with a Broad genotype got that genotype info, a snp with just a UMich genotype got that info, any snp only called by the NCBI and Boston College didn't get any genotype
We have just released a new data set which has much more complete phased genotypes for a larger number of individuals but we don't have population level allele frequencies yet
Hi Laura, any idea when the mitochondrial genotypes will be released for the interim phase1 release (or a newer, of course)? I'm asking since I found out there's a lot of missing data in the first release, so let's say we only have a "true" access to 3% of the whole genotype dataset for MT.
Secondly, how should I interprete the biallelic genotype calls for the MT - being that haploid (or, if you want to see it otherwise, N-ploid)?
We recently realised that the person who generated mt genotypes for the pilot used . where they should of used 0 so that makes that data set much more useful. Hetrozygous genotypes represent mitochondrial heteroplasmy. We hope to have new MT genotypes before the end of the year but I can't give a better timeline than that
Hi, there I kind of have the same question. I have downloaded data for a particular gene - there are huge amounts of missing data!
The missing data is not for individual sites (ie everyone is missing data for a SNP). The missing data is totally haphazard. Eg SNP 1 has data for african americans and Europeans, whilst SNP 2 has data for Yoruba, Asians and lacks african americans.
Don't post a new question in the answer section. Ask your question separately as a new question. Feel free to link to this question as an example. This post will be deleted; we will leave it here a bit so that you can see this comment.
are the genotypes missing for all samples at a given site, or are you saying that of the 13000 SNPs, 5000 have at least one missing genotype?
Hi, in most cases the genotypes seem to be missing in all samples at a given site.