If you want to do things locally, you can make a searchable resource you can query as often you like.
1) Get SNPs and write them into a text file sorted by SNP ID.
For hg19
, for instance, here is a way to get all SNPs into a form queryable with a binary search:
$ LC_ALL=C
$ wget -qO- https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz \
| gunzip -c \
| vcf2bed --sort-tmpdir=${PWD} - \
| awk -v FS="\t" -v OFS="\t" '{ print $4, $1, $2, $3 }' \
| sort -k1,1 \
> hg19.dbSNPAllV151.sortedByName.txt
This will take a while to build and will take considerable disk space, compared with the compressed option.
You may also want additional columns, rather than the position and ID alone. Use head
to inspect the output of awk
and decide what you want to keep and discard. Read the documentation for vcf2bed
to be sure that you understand how one-indexed VCF records are turned into zero-indexed BED positions.
However you do this step, you should ultimately have the rsID in the first column, and you want the variants sorted lexicographically by rsID.
2) Install pts-line-bisect:
$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..
3) Run a query:
$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg19.dbSNPAllV151.sortedByName.txt ${rs_of_interest} \
| head -1 \
| awk -v FS="\t" -v OFS="\t" '{ print $2, $3, $4, $1 }'
Step 3 can be put into a script so that you can re-run it with your SNP of interest on demand.
For instance:
#!/bin/bash
pts_lbsearch_bin=./pts-line-bisect/pts_lbsearch
sorted_snp_txt=hg19.dbSNPAllV151.sortedByName.txt
${pts_lbsearch_bin} -p ${sorted_snp_txt} $1 | head -1 | awk -v FS="\t" -v OFS="\t" '{ print $2, $3, $4, $1 }'
Then:
$ ./search_snps.sh rs2814778 > rs2814778.bed
Use scripting to loop through a list of SNPs, to get however many positions are needed. Pipe all results to sort-bed
to get a sorted BED file.
Does this mean this question is solved? Please separate your post into question and answer, post the answer below.
Still very low. Wait for some other faster solution.
Obviously, all the index method is target to genomic region, rather than rsID, therefore, if you search by rsID, it is always very very slow from gnomad.genomes.r2.1.sites.vcf.bgz or All_20180423.vcf.bgz. I think the only solution maybe build an index version to rsID.