Question

Obtaining Dbsnp Identifiers From Ucsc Snp131 Table.

4

Entering edit mode

14.3 years ago

Daniel ▴ 50

I am currently using the UCSC SNP database to obtain dbSNP identifiers for a given location (equivalent with chromEnd field in table). The description of the SNP131 table states that the table contains:

Polymorphism data from dbSnp database or genotyping arrays.

I was wondering if there is any way to differentiate between different sources. As far as I have seen in the documentation, earlier releases of the UCSC tables contained a source field [1]. Is there any way to filter the source with build 131 ?

In fact, I am only interested in dbSNP identifiers that come from GRCh37 but I am afraid that for this I need to either parse the XML data or rebuild the database files from NCBI's dbSNP. Is this assumption correct or is there some way to do this with the much more convenient UCSC mysql dumps ?

I don't have much experience with either of those sources so I appreciate any pointer into the right direction. cheers.

[1] http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#Snp

dbsnp ucsc • 5.5k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 14.3 years ago by Daniel ▴ 50

Ram · Answer 1 · 2010-09-29

5

Entering edit mode

14.3 years ago

Pierre Lindenbaum 164k

I'm missing something here: you said you want the SNP for dbsnp131 and GRCh37 that is to say the UCSC genome hg19. This table is available from the UCSC 'download' area or from mysql:

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg19 -e 'select name,chrom,chromStart,chromEnd from snp131  limit 10'</p>
+------------+-------+------------+----------+
| name       | chrom | chromStart | chromEnd |
+------------+-------+------------+----------+
| rs56289060 | chr1  |      10433 |    10433 |
| rs55998931 | chr1  |      10491 |    10492 |
| rs62636508 | chr1  |      10518 |    10519 |
| rs58108140 | chr1  |      10582 |    10583 |
| rs10218492 | chr1  |      10827 |    10828 |
| rs10218493 | chr1  |      10903 |    10904 |
| rs10218527 | chr1  |      10926 |    10927 |
| rs28853987 | chr1  |      10937 |    10938 |
| rs79537094 | chr1  |      11001 |    11002 |
| rs28484712 | chr1  |      11013 |    11014 |
+------------+-------+------------+----------+

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 14.3 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

There are several reasons a single dbSNP identifier might map to several locations like the example you cited: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq&part=Build.The_dbSNP_Mapping_Process#Build.why_does_the_chromosome_report_sho.

It can map to different chromosomes if it's in a highly repetitive or duplicated region.

Of course now you have my curiosity peaked as to why this one specifically.

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 14.3 years ago by Treylathe ▴ 950

1

Entering edit mode

When a new SNP is inserted in a database, it is believed that its flanking sequences will be sufficient to map it once on the genome. Later, as the whole genome is sequenced, it is find that the region was not as unique as it was thought and the snp can be mapped several time (genomic duplications, pseudogenes).

ADD REPLY • link 14.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Pierre, thank you for the answer. I am using a copy of UCSC's SNP131 dump locally on my machine. My confusion with this dataset started at the point when I realized that many dbSNP identifiers map to the very same location (e.g. chromEnd='165360'). Similarly, a single dbSNP identifier often has more than one location (e.g. name = 'rs74957741'). I dont understand how this can happen. I suppose my mind automatically assumed that what I perceive as inconsistencies with dbSNP identifiers is due to many different sources. I hope this makes more sense.

ADD REPLY • link 14.3 years ago by Daniel ▴ 50

0

Entering edit mode

Of course this won't answer it for all the anomalies, but the one example you give rs74957741 does indeed seem to map to a repetitive/duplicated region. I blatted about 1000bp of flanking region of one location (on Y) and the three top BLAT hits were the other two locations (X and 9) with 100 and 97.9% identity.

ADD REPLY • link 14.3 years ago by Treylathe ▴ 950

0

Entering edit mode

Thanks for the answer. So a dbSNP identifier of a SNP is defined by the variation + flanking sequences ? If it looks the the same the location does not matter ?

ADD REPLY • link 14.3 years ago by Daniel ▴ 50

0

Entering edit mode

Pierre: Thanks. Initially, I assumed that when an additional mapping is found that it would get another dbSNP identifier. Anyway, can you explain to me why one site can have multiple identifiers ?

ADD REPLY • link 14.3 years ago by Daniel ▴ 50

0

Entering edit mode

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq&part=Build.Merging_RefSNP_Numbers_and_RefSNP#Build.we_observed_two_different_snps_at If you mean “can two SNPs map to the same contig and position”, then yes, it is possible. If the two SNPs map to same contig location, but have different variation classes (e.g. a true SNP like “A/G”, and an in/del SNP like “-/A”), we will not cluster them in the future. If the two SNPs have the same variation class (e.g. both are true single base substitutions), then we will merge them in a subsequent build. (3/3/05)

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 14.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

True, so technically there shouldn't be true SNPs that map to same site, right ?

When I run a query on UCSC's SNP131 to search for all sites that have more than a single true SNP (class='single') mapped onto, the query returns 1.5 million entries out of about 20 million true SNPs in the data set.

Is this just an inconsistency between UCSC and dbSNP ? I read that dbSNP often merges entries together only after a new build is introduced but 1.5 million seems a bit much.

Or am I getting something wrong here ?

ADD REPLY • link 14.3 years ago by Daniel ▴ 50

0

Entering edit mode

Anyway, I will mark your answer as accepted since the thread has moved away from the original question.

ADD REPLY • link 14.3 years ago by Daniel ▴ 50

score 2 · Answer 2 · 2010-09-29

2

Entering edit mode

14.3 years ago

Treylathe ▴ 950

GRCh37 is hg19 as Pierre mentions. So, I would assume that using the table browser with the hg19 build, dbSNP 131, you'd get only dbSNP identifiers from that build. Unless I'm missing something in the question.

You could simply, in the Table Browser interface, choose the genome, Feb. 2009 (GRCh37/hg19) assembly, then SNPs (131) track and snp131 table, and then fields from selected table (or selected fields) to get the identifiers. Or for a direct sql approach, Pierre's suggestion.

Also, a word of caution, that description page at UCSC is no longer maintained:

NOTE: This page is no longer maintained. For complete up-to-date table descriptions, use the "describe table schema" button in the Table Browser.

It'd be nice to see a page generated from the table schema to recreate this page (I like it for browsing and searching, which I can't do with the table schema buttons), just be cautious using the page as the descriptions are getting old.

ADD COMMENT • link 14.3 years ago by Treylathe ▴ 950

0

Entering edit mode

I just used the old description page to point out that previous versions had a source field which the most current version is missing. The name of the page 'gbdDescriptionsOld' already implied to me that this page is a bit older. Thanks anyway for confirming that the information there is obsolote.

ADD REPLY • link 14.3 years ago by Daniel ▴ 50

0

Entering edit mode

Figured you did, but I know I miss details like that sometimes :).. and for other readers in case they did.

And a good place to put my wish for an updated page of descriptions! :D

ADD REPLY • link 14.3 years ago by Treylathe ▴ 950