Question

Errors In Dbsnp

4

Entering edit mode

13.7 years ago

Pi ▴ 520

Greetings

I am trying to get a handle on the quality of submissions to dbSNP. The list of validation statuses are given as:

multiple independent submissions;
frequency or genotype data;
submitter confirmation;
observation of all alleles in at least two chromosomes;
genotyped by HapMap;
sequenced in the 1000 Genomes Project

points 1,5 and 6 seem fairly reliable but I am interested in point 2/4.

How accurate does the genotype data need to be in point 2? For example we have carried out sequencing work and found potential snps only to find they were erroneous on resequencing. This data was for pooled DNA but we did have high quality counts for both alleles with good coverage.

edit: it has been pointed out by DQ (thank-you) that most genotypes are confirmed by sanger sequencing. Can i assume that the genotype and allele frequencies in dbSNP are based on confirmed genotypes via a method such as sanger sequencing?

Thank you for your time

dbsnp error • 2.8k views

ADD COMMENT • link updated 13.6 years ago by David Quigley 11k • written 13.7 years ago by Pi ▴ 520

score 4 · Answer 1 · 2011-04-20

4

Entering edit mode

13.7 years ago

David Quigley 11k

The gold standard and simplest method for validating one or a few candidate SNPs is Sanger sequencing. There is a long paragraph in the wikipedia entry for dbSNP about data quality; there are some citations you may find helpful there. The NCBI has information about the validation statuses. See also this table. From NCBI:

"Validation by HapMap" in dbSNP simply means that a SNP was genotyped in HapMap (phase 1 & 2 over 270 samples, phase 3 over 1115 samples (not in dbSNP yet)...You should therefore look at "Validation by HapMap" in conjunction with "Validation by Frequency" to verify that the SNP’s minor allele has been observed at least twice

and

Validation by Frequency includes both population frequency data AND genotype data. In fact, the number of SNPs that have genotype data is bigger then the number of SNPs with only population frequency data. We compute frequency based on genotype data.

etc...

Not an expert here myself, and but this may point you in the right direction to make these codes a bit less cryptic.

ADD COMMENT • link 13.7 years ago by David Quigley 11k

0

Entering edit mode

Thanks for your answer. I am assuming then that when then say an allele has to have been seen in more than 2 chromosomes they mean by a technique as reliable as sanger sequencing because i've looked at NGS sequencing data for individuals that looks like they could be heterozygous and turn out to be homozygous for either the reference allele or the novel allele

ADD REPLY • link 13.7 years ago by Pi ▴ 520

0

Entering edit mode

Thanks for your answer. I am assuming then that when then say an allele has to have been seen in more than 2 chromosomes they mean by a technique as reliable as sanger sequencing because i've looked at NGS sequencing data for individuals that looks like they could be heterozygous and turn out to be homozygous for either the reference allele or the novel allele so that data in itself isn't reliable. I couldn't find anything that said what detection method was 'reliable enough'

ADD REPLY • link 13.7 years ago by Pi ▴ 520