Extract Individual Genotypes From 1000 Genomes When Snp Not In Vcf
2
1
Entering edit mode
11.8 years ago

I would like individual level genotypes for a SNP that appears in the 1000 genomes browser and dbSNP. When I pull the VCF for the region using the Data Slicer, I get calls for SNPs around my Mystery SNP but not for my actual mystery SNP. Pulling down the VCF and searching with tabix gives the same result. It's not clear to me why this SNP doesn't have individual calls. Is the most reasonable way to go forward to pull down the region around the SNP from the source BAM files and call genotypes with mpileup? If so, is there a better way to do this than manually scripting it out? Thanks.

1000genomes • 4.9k views
ADD COMMENT
2
Entering edit mode

The most likely explanation is that your SNP is in dbSNP but not in 1000Genomes. dbSNP contains many false positive SNPs, so maybe they have removed it because they didn't find it in the 1000 Genomes data.

ADD REPLY
0
Entering edit mode

A equally likely cause is that 1000g missed it. There are a whole bunch of filters in SNP calling. We know occasionally even common SNPs may get filtered out.

ADD REPLY
0
Entering edit mode

The unfiltered input call sets for phase1 can be found ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/input_call_sets/

If your site is in this set and we filter it out we believe it is a false positive

If you site isn't in this set it might be real but rare or it might be a false positive, you would have to assess the quality and source of your data to make that decision

ADD REPLY
2
Entering edit mode
11.8 years ago
lh3 33k

My suggestion is to check the individual call set first, which are available here. If you cannot find the SNP in all of them, it is likely to be a false one. A caveat is most of these call sets do not provide accurate genotypes. When you can get genotype likelihoods in GL or PL, you can use beagle to impute genotypes.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion and link. Would that be the data in 20110512wgVQSRv2GLbeagle_genotypes?

ADD REPLY
1
Entering edit mode

First check if the SNP is called in any call sets from 20110302_phase1_wg_snps. All the call sets are filtered, but if every call set has filtered out your SNP, it is likely to be a false one. Once you confirm the presence of the SNP, you can check 20110512. If it is not there, you will need to extract GLs from a 20110302 call set and run beagle by yourself.

ADD REPLY
0
Entering edit mode

Thanks again for the help. SNP's not in those call sets, and I now see that following the link back to 1000g from dbSNP puts the SNP on a track called "dbSNP submissions not present in 1000 Genomes", so that presumably means the SNP entered the 1000g browser from dbSNP but there's no evidence in the 1000g data for it.

ADD REPLY
2
Entering edit mode
11.8 years ago
Adam ★ 1.0k

Another possibility is that this SNP was called in the early phases of the 1000G, but removed in later phases as calling methods improved. Some of those old SNPs might be in older versions of dbSNP, which could cause confusion.

ADD COMMENT

Login before adding your answer.

Traffic: 2939 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6