I have a huge dataset of SNPs that I am trying to get hg19 locations for based only on the rsID. Right now I am doing that by just downloading the latest version of dbSNP, turning it into an sqlite database, and doing a huge long running query. This works, although it takes forever, but it has the problem that it fails for all SNPs that have been merged, e.g. rs111199278.
I need some way in batch to be able to get the position of these old rsIDs, possibly by just converting them to the latest rsID in dbSNP and then looking them up with my slow database query.
Is there a good way to do this efficiently? I think I will end up needing to lookup about 100K-500K rsIDs (I don't know yet because my big location lookup hasn't finished).
Maybe a
that gives you "merged into rs#" might be of help?
Yeah, that would work in principle, but I would need to put a delay of at least half a second between requests to avoid making NCBI angry, which would put the run time at around half a day. Still, it is a pretty good idea if there isn't a more efficient way to do it somewhere. I was kind of hoping for some kind of lookup table or queryable database though.
This seems to me to be an important topic, and I wonder if anyone can comment on the process by which NCBI keeps this resource up to date.
I found the change stamp for this merge id resource at
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/database/organism_data/
specifically
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/database/organism_data/RsMergeArch.bcp.gz
The timestamp is
RsMergeArch.bcp.gz 2018-02-07 12:09 146M
So that table is 3 years old, but notice it is associated with (according to the directory path) b151 (dbSNP) and GRCh38 (hg38) ...so perhaps it is "up to date" even though 3 years old.
Has the merging process ended? If not, is there a more recent resource to use?
I think you should ask NCBI support about this by putting a ticket in via their help desk. Please post their response here once you hear back from them.
Thank you for this tip. Here is the answer from NLM:
"Those files are from the legacy snp build process, which was decommissioned in 2018.
The SNP build has been using SPDI-based asserted location workflow since build 152. That system no longer produces relevant file on this. The new build system only produces json and vcf: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/"