Question

How to search dbSNP using a list of SNPs and retrieve Gene name (hgnc symbol if existing, otherwise just whatever is in there)

1

Entering edit mode

3.6 years ago

f-rasmussen ▴ 10

I have a list of 500.000 SNPs from which I want to obtain the gene name. I try to search with biomaRt

library(data.table)
library(biomaRt)

rs <- fread("SNPs.txt")
ensembl_version = "https://dec2016.archive.ensembl.org"
ensembl <- useMart("ENSEMBL_MART_SNP", dataset = "hsapiens_snp")


getBM(attributes=c("refsnp_id", "associated_gene"), filters="snp_filter", values=rs, mart=ensembl, uniqueRows=TRUE)

However many of the SNPs return NA or simply nothing. Show here:

      refsnp_id associated_gene
1      rs425277           PRKCZ
2     rs1571149                
3     rs1240707                
4     rs1240708                
5      rs873927                
6      rs880051           SSU72
7      rs904589                
8      rs908742                
9      rs909823                
10     rs925905                
11       rs7290                
12       rs7407                
13    rs1878745                
14    rs2296716           SSU72
15    rs2298217                
16    rs2459994

When I search some of the rsIDs which did not produce a gene name on dbSNP, they are in fact associated with a gene name in the database. My question is then, how can I connect biomaRt to dbSNP and retrieve the correct gene names for all the SNPs in the list 'SNPs.txt'?

biomaRt R • 4.3k views

ADD COMMENT • link updated 2.9 years ago by Mahan ▴ 70 • written 3.6 years ago by f-rasmussen ▴ 10

score 0 · Answer 1 · 2021-03-31

Using EntrezDirect:

$ more id
rs425277
rs1571149
rs1240707
rs1240708
rs873927
rs880051
rs904589

$ for i in `cat ./id`; do esearch -db snp -query "${i} AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element SNP_ID,NAME | uniq ; done
425277  PRKCZ
1571149 TMEM240
1240707 CCNL2   MRPL20-AS1
1240708 CCNL2   MRPL20-AS1
873927  LINC01770   LOC107985729
880051  SSU72
904589  ANKRD65 LOC105378585

GenoMax · Answer 2 · 2021-03-31

0

Entering edit mode

3.6 years ago

Alex Reynolds 36k

Via Python and the biothings_client library:

#!/usr/bin/env python

from biothings_client import get_client
import sys

variants = [
    'rs425277',
    'rs1571149',
    'rs1240707',
    'rs1240708',
    'rs873927',
    'rs880051',
    'rs1878745',
    'rs2296716',
    'rs2298217',
    'rs2459994'
]
vm = {x : None for x in variants}
vc = get_client('variant')
rs = vc.querymany(variants, scopes='dbsnp.rsid', fields='all', verbose=False)
for r in rs:
    q = r['query']
    if 'dbsnp' in r:
        d = r['dbsnp']
        if 'gene' in d:
            g = d['gene']
            if 'symbol' in g and not vm[q]:
                vm[q] = g['symbol']
for v in variants:
    sys.stdout.write('{}\t{}\n'.format(v, vm[v]))

Sample output:

$ ./get_hgnc_symbols_for_variants.py
rs425277    PRKCZ
rs1571149   TMEM240
rs1240707   None
rs1240708   None
rs873927    None
rs880051    SSU72
rs1878745   PRKCZ
rs2296716   SSU72
rs2298217   None
rs2459994   PRKCZ

There are lots of other attributes you could inspect and retrieve, by editing the for r in rs loop.

ADD COMMENT • link 3.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi Alex! Thanks for this information. I have a list of SNPs in a .csv file along with other columns. Can you please tell me how I can supply those SNPs in the csv file in place of the following format?

variants = [ 'rs425277', 'rs1571149', 'rs1240707', 'rs1240708', 'rs873927', 'rs880051', 'rs1878745', 'rs2296716', 'rs2298217', 'rs2459994' ]

Thanks in advance

ADD REPLY • link 3.3 years ago by Mahan ▴ 70

1

Entering edit mode

The exact code will depend on your CSV file:

import io
import pandas as pd

csv_as_string = '''a,b,c,d,e,f'''

variants = pd.read_csv(io.StringIO(csv_as_string), header=None, skipinitialspace=True)
# chop up variants, as needed

ADD REPLY • link 3.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks Alex

ADD REPLY • link 3.2 years ago by Mahan ▴ 70

0

Entering edit mode

Hi Alex, I had used this methods to get gene names earlier. But when I tried it again, I'm getting the following error:

Variants = SNPs['SNP'].tolist()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'SNP'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-271f5f94c88d> in <module>
----> 1 Variants = SNPs['SNP'].tolist()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'SNP'

Can you please suggest how I can address this issue? TIA

ADD REPLY • link updated 2.9 years ago by GenoMax 147k • written 2.9 years ago by Mahan ▴ 70

1

Entering edit mode

The KeyError you are getting is telling you there is no column (or "key") called SNP in the dataframe SNPs.

You may need to print out the list of column headers to find out what columns you do have, perhaps via print(SNPs.head()) or similar.