Undoubtedly, UCSC's snp132
table is the most convenient when looking up rsIDs for a list of variants. I have tried using NCBI's dbSNP135 resources, but they are either less convenient to use or do not have all the data in one place: b135_SNPChrPosOnRef_37_3.bcp.gz
doesn't provide ref/alt/length information, while 00-All.vcf.gz
is harder to work with than simple tab-separated files or a database.
Hence the question: how does UCSC prepare that snp132
table from dbSNP data?
I've looked at the source they published, but it doesn't seem to deal with data preparation (this seems to be the case at least for dbSNP).
Without this knowledge (or, rather, a tool-chain), I'm left with these options:
- use the slightly outdated
snp132
(easiest) - parse NCBI's
00-All.vcf.gz
into a database table (a little more effort than above)
All I'm really missing in the b135_SNPChrPosOnRef_37_3.bcp.gz
file are ref
and alt
(from which I can infer the length).
Alternatively, I'd love to use some command-line utility to convert NCBI's VCF to a simpler BED-like format (had no success doing that with vcf-to-tab
from vcftools
).
Edit: as a matter of fact, UCSC's public code repository does have the snpNcbiToUcsc.c
source code, in src/hg/snp/snpLoad
Thanks, I've updated the question with the [overlooked] path to snpNcbiToUcsc.c