Why So Many Missing Genotypes In 1000 Genomes Data?
2
3
Entering edit mode
13.4 years ago
Paul ▴ 760

Hi,

I downloaded what I think is the latest set of SNP variant calls for the European samples from the 1000 genomes here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz

I used tabix to download the data and vcftools to convert to a matrix of 0, 1, 2 and -1 for missing genotype.

I'm interested in SNPs in the exons of a particular gene, the thing is, of 13,000 SNPs that there is data, there are a huge number of missing values, nearly 5,000 for which genotypes is missing.

Does anyone have any idea why there are so many missing genotypes in this data?

Thanks!

genome snp • 4.4k views
ADD COMMENT
1
Entering edit mode

are the genotypes missing for all samples at a given site, or are you saying that of the 13000 SNPs, 5000 have at least one missing genotype?

ADD REPLY
0
Entering edit mode

Hi, in most cases the genotypes seem to be missing in all samples at a given site.

ADD REPLY
6
Entering edit mode
13.4 years ago
Laura ★ 1.8k

The 20100804 data has missing genotypes due to the way it was created

The set itself was a naive 2 of 4 intersection of 4 input call sets, only 2 of these 4 sets had genotypes associated with them, Broad and UMich, the Broad genotype set was phased and used LD info so was felt to be better so any snp with a Broad genotype got that genotype info, a snp with just a UMich genotype got that info, any snp only called by the NCBI and Boston College didn't get any genotype

We have just released a new data set which has much more complete phased genotypes for a larger number of individuals but we don't have population level allele frequencies yet

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/

thanks

ADD COMMENT
0
Entering edit mode

Hi Laura, any idea when the mitochondrial genotypes will be released for the interim phase1 release (or a newer, of course)? I'm asking since I found out there's a lot of missing data in the first release, so let's say we only have a "true" access to 3% of the whole genotype dataset for MT. Secondly, how should I interprete the biallelic genotype calls for the MT - being that haploid (or, if you want to see it otherwise, N-ploid)?

Thanks in advance for your time

ADD REPLY
0
Entering edit mode

We recently realised that the person who generated mt genotypes for the pilot used . where they should of used 0 so that makes that data set much more useful. Hetrozygous genotypes represent mitochondrial heteroplasmy. We hope to have new MT genotypes before the end of the year but I can't give a better timeline than that

ADD REPLY
0
Entering edit mode
13.3 years ago
User 7433 ▴ 170

Hi, there I kind of have the same question. I have downloaded data for a particular gene - there are huge amounts of missing data!

The missing data is not for individual sites (ie everyone is missing data for a SNP). The missing data is totally haphazard. Eg SNP 1 has data for african americans and Europeans, whilst SNP 2 has data for Yoruba, Asians and lacks african americans.

Can anyone explain why this is?

Thanks x

ADD COMMENT
0
Entering edit mode

Don't post a new question in the answer section. Ask your question separately as a new question. Feel free to link to this question as an example. This post will be deleted; we will leave it here a bit so that you can see this comment.

ADD REPLY

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6