I am currently using the UCSC SNP database to obtain dbSNP identifiers for a given location (equivalent with chromEnd field in table). The description of the SNP131 table states that the table contains:
Polymorphism data from dbSnp database or genotyping arrays.
I was wondering if there is any way to differentiate between different sources. As far as I have seen in the documentation, earlier releases of the UCSC tables contained a source field [1]. Is there any way to filter the source with build 131 ?
In fact, I am only interested in dbSNP identifiers that come from GRCh37 but I am afraid that for this I need to either parse the XML data or rebuild the database files from NCBI's dbSNP. Is this assumption correct or is there some way to do this with the much more convenient UCSC mysql dumps ?
I don't have much experience with either of those sources so I appreciate any pointer into the right direction. cheers.
[1] http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#Snp
There are several reasons a single dbSNP identifier might map to several locations like the example you cited: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq&part=Build.The_dbSNP_Mapping_Process#Build.why_does_the_chromosome_report_sho.
It can map to different chromosomes if it's in a highly repetitive or duplicated region.
Of course now you have my curiosity peaked as to why this one specifically.
When a new SNP is inserted in a database, it is believed that its flanking sequences will be sufficient to map it once on the genome. Later, as the whole genome is sequenced, it is find that the region was not as unique as it was thought and the snp can be mapped several time (genomic duplications, pseudogenes).
Pierre, thank you for the answer. I am using a copy of UCSC's SNP131 dump locally on my machine. My confusion with this dataset started at the point when I realized that many dbSNP identifiers map to the very same location (e.g. chromEnd='165360'). Similarly, a single dbSNP identifier often has more than one location (e.g. name = 'rs74957741'). I dont understand how this can happen. I suppose my mind automatically assumed that what I perceive as inconsistencies with dbSNP identifiers is due to many different sources. I hope this makes more sense.
Of course this won't answer it for all the anomalies, but the one example you give rs74957741 does indeed seem to map to a repetitive/duplicated region. I blatted about 1000bp of flanking region of one location (on Y) and the three top BLAT hits were the other two locations (X and 9) with 100 and 97.9% identity.
Thanks for the answer. So a dbSNP identifier of a SNP is defined by the variation + flanking sequences ? If it looks the the same the location does not matter ?
Pierre: Thanks. Initially, I assumed that when an additional mapping is found that it would get another dbSNP identifier. Anyway, can you explain to me why one site can have multiple identifiers ?
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq&part=Build.Merging_RefSNP_Numbers_and_RefSNP#Build.we_observed_two_different_snps_at If you mean “can two SNPs map to the same contig and position”, then yes, it is possible. If the two SNPs map to same contig location, but have different variation classes (e.g. a true SNP like “A/G”, and an in/del SNP like “-/A”), we will not cluster them in the future. If the two SNPs have the same variation class (e.g. both are true single base substitutions), then we will merge them in a subsequent build. (3/3/05)
True, so technically there shouldn't be true SNPs that map to same site, right ?
When I run a query on UCSC's SNP131 to search for all sites that have more than a single true SNP (class='single') mapped onto, the query returns 1.5 million entries out of about 20 million true SNPs in the data set.
Is this just an inconsistency between UCSC and dbSNP ? I read that dbSNP often merges entries together only after a new build is introduced but 1.5 million seems a bit much.
Or am I getting something wrong here ?
Anyway, I will mark your answer as accepted since the thread has moved away from the original question.