how to find genes based on SNP rsIDs
1
0
Entering edit mode
6.9 years ago

I have 8 SNPs meeting sigificance in a GWAS and need to find which genes they are associated with. I have searched by rsIDs using NCBI gene though can't find any association and it doesn't appear there are any associations in the literature. How can I look these up? I have searched them in the UCSC genome browser, though am not sure where to look to find associated genes. Please help

rsID GWAS SNP • 4.0k views
ADD COMMENT
0
Entering edit mode

Did you try searching on NCBI dbSNP?

ADD REPLY
0
Entering edit mode

In addition to Ram's suggestion, if you know flanking sequences, just do NCBI BLAST search.

ADD REPLY
0
Entering edit mode
6.9 years ago

If you want a fast (binary) search with a locally-stored dataset, where you know exactly what SNPs and gene annotations are being queried, here's a way to set up files to do your own querying, which scales up very well when you have a lot of SNPs.

1) Get annotations of interest and write them into a sorted BED file.

For example, here's a way to get Gencode v26 gene annotations:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz \
    | gunzip -c \
    | convert2bed --input=gff - \
    > gencode.v26.bed

2) Get SNPs and write them into a text file sorted by SNP ID.

For hg38, for instance:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $5,$2,$3,($3+1) }' \
   | sort -k1,1 \
   > hg38.snp147.sortedByName.txt

3) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

4) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | cut -f2- \
   | bedmap --echo --echo-map-id-uniq --range 1000 - gencode.v26.bed \
   > answer.bed

The file answer.bed contains the genomic interval of the SNP of interest, along with the IDs of any Gencode v26 gene annotations within 1000 nt upstream and downstream of the SNP.

To repeat step 4 within a Python script, you can use subprocess.check_output():

...
pts_lbsearch = os.path.join(project_dir, "pts-line-bisect", "pts_lbsearch")
if not os.path.exists(pts_lbsearch):
  print "Content-type:application/json\r\n\r\n"
  print json.dumps({'error' : 'could not locate pts_lbsearch [%s]' % (pts_lbsearch)})
  sys.exit(errno.ENOENT)

if term.startswith('rs'):
  query_fn = os.path.join(annotations_dir, "hg38.snp147.sortedByName.txt")
  cmd = "%s -p %s %s" % (pts_lbsearch, query_fn, term)
  try:
    res = subprocess.check_output(cmd, shell=True)
    if res and len(res) > 0:
      (chr, start, stop) = res.strip().split('\n')[0].split('\t')
    else:
      pass
  except subprocess.CalledProcessError as err:
    sys.stderr.write("Query failed!\n")
    sys.exit(errno.EIO)

#
# map 'chr', 'start', and 'stop' variables to annotations with 
# another subprocess call (not shown -- use example above)
#
...
ADD COMMENT

Login before adding your answer.

Traffic: 2674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6