Errors In Dbsnp
1
4
Entering edit mode
13.6 years ago
Pi ▴ 520

Greetings

I am trying to get a handle on the quality of submissions to dbSNP. The list of validation statuses are given as:

  1. multiple independent submissions;
  2. frequency or genotype data;
  3. submitter confirmation;
  4. observation of all alleles in at least two chromosomes;
  5. genotyped by HapMap;
  6. sequenced in the 1000 Genomes Project

points 1,5 and 6 seem fairly reliable but I am interested in point 2/4.

How accurate does the genotype data need to be in point 2? For example we have carried out sequencing work and found potential snps only to find they were erroneous on resequencing. This data was for pooled DNA but we did have high quality counts for both alleles with good coverage.

edit: it has been pointed out by DQ (thank-you) that most genotypes are confirmed by sanger sequencing. Can i assume that the genotype and allele frequencies in dbSNP are based on confirmed genotypes via a method such as sanger sequencing?

Thank you for your time

dbsnp error • 2.8k views
ADD COMMENT
4
Entering edit mode
13.6 years ago

The gold standard and simplest method for validating one or a few candidate SNPs is Sanger sequencing. There is a long paragraph in the wikipedia entry for dbSNP about data quality; there are some citations you may find helpful there. The NCBI has information about the validation statuses. See also this table. From NCBI:

"Validation by HapMap" in dbSNP simply means that a SNP was genotyped in HapMap (phase 1 & 2 over 270 samples, phase 3 over 1115 samples (not in dbSNP yet)...You should therefore look at "Validation by HapMap" in conjunction with "Validation by Frequency" to verify that the SNP’s minor allele has been observed at least twice

and

Validation by Frequency includes both population frequency data AND genotype data. In fact, the number of SNPs that have genotype data is bigger then the number of SNPs with only population frequency data. We compute frequency based on genotype data.

etc...

Not an expert here myself, and but this may point you in the right direction to make these codes a bit less cryptic.

ADD COMMENT
0
Entering edit mode

Thanks for your answer. I am assuming then that when then say an allele has to have been seen in more than 2 chromosomes they mean by a technique as reliable as sanger sequencing because i've looked at NGS sequencing data for individuals that looks like they could be heterozygous and turn out to be homozygous for either the reference allele or the novel allele

ADD REPLY
0
Entering edit mode

Thanks for your answer. I am assuming then that when then say an allele has to have been seen in more than 2 chromosomes they mean by a technique as reliable as sanger sequencing because i've looked at NGS sequencing data for individuals that looks like they could be heterozygous and turn out to be homozygous for either the reference allele or the novel allele so that data in itself isn't reliable. I couldn't find anything that said what detection method was 'reliable enough'

ADD REPLY

Login before adding your answer.

Traffic: 1799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6