If you want a fast (binary) search with a locally-stored dataset, where you know exactly what SNPs and gene annotations are being queried, here's a way to set up files to do your own querying, which scales up very well when you have a lot of SNPs.
1) Get annotations of interest and write them into a sorted BED file.
For example, here's a way to get Gencode v26 gene annotations:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz \
| gunzip -c \
| convert2bed --input=gff - \
> gencode.v26.bed
2) Get SNPs and write them into a text file sorted by SNP ID.
For hg38
, for instance:
$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
| gunzip -c \
| awk -v OFS="\t" '{ print $5,$2,$3,($3+1) }' \
| sort -k1,1 \
> hg38.snp147.sortedByName.txt
3) Install pts-line-bisect:
$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..
4) Run a query:
$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} \
| head -1 \
| cut -f2- \
| bedmap --echo --echo-map-id-uniq --range 1000 - gencode.v26.bed \
> answer.bed
The file answer.bed
contains the genomic interval of the SNP of interest, along with the IDs of any Gencode v26 gene annotations within 1000 nt upstream and downstream of the SNP.
To repeat step 4 within a Python script, you can use subprocess.check_output()
:
...
pts_lbsearch = os.path.join(project_dir, "pts-line-bisect", "pts_lbsearch")
if not os.path.exists(pts_lbsearch):
print "Content-type:application/json\r\n\r\n"
print json.dumps({'error' : 'could not locate pts_lbsearch [%s]' % (pts_lbsearch)})
sys.exit(errno.ENOENT)
if term.startswith('rs'):
query_fn = os.path.join(annotations_dir, "hg38.snp147.sortedByName.txt")
cmd = "%s -p %s %s" % (pts_lbsearch, query_fn, term)
try:
res = subprocess.check_output(cmd, shell=True)
if res and len(res) > 0:
(chr, start, stop) = res.strip().split('\n')[0].split('\t')
else:
pass
except subprocess.CalledProcessError as err:
sys.stderr.write("Query failed!\n")
sys.exit(errno.EIO)
#
# map 'chr', 'start', and 'stop' variables to annotations with
# another subprocess call (not shown -- use example above)
#
...
Did you try searching on NCBI dbSNP?
In addition to Ram's suggestion, if you know flanking sequences, just do NCBI BLAST search.