Map a list of gene names in UniProt
0
0
Entering edit mode
2.9 years ago
Space_Life ▴ 50

Hi all, I have a long list of gene names (>7000) which is coming from more than 50 organisms. I am not worried about where it is coming from, however, I need to get the most accurate GO terms or at least UniProtKB of these genes so that I can categorize them. The ID mapping of UniProt requires the organism name and this is taking huge time and data overlap. My aim is to be able to categorize these genes based on GO terms and pathways they are associated to. I would appreciate any suggestion. Thank you.

Mapping UniProt ID • 1.2k views
ADD COMMENT
0
Entering edit mode

what is the source of the gene names that you have, NCBI/Entrez or Ensembl or something else? Knowing what id is used for your genes may help to provide some examples/suggestions.

That said, you can programmatically query both NCBI and Esembl via API:
https://rest.ensembl.org/
https://www.ncbi.nlm.nih.gov/home/develop/api/

ADD REPLY
0
Entering edit mode

Thank you for replying. The genes are from UniProt. I tried its REST API but it just gets stuck with no output.

<import urllib.parse
import urllib.request

url = 'https://www.uniprot.org/uploadlists/'

params = {
'from': 'GENENAME',    
'to': 'ACC',
'format': 'tab',
'columns': 'id,entry_name,reviewed',
'query': 'tsf, pyrH, frr, ispH, btuD_1, rpoC, rpoB, rplL, rplJ, rplA, rplK',
'taxon': '391774'
}

data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as f:
   response = f.read()
print(response.decode('utf-8'))

I just tested this. If I remove 'taxon' from this code, the output is nothing. Thank you.

ADD REPLY
0
Entering edit mode

First, I think there is some confusion. Your gene names are not accession ids which is what I was inquiring about in my first question. What you have are generic gene symbols and the associated taxonomy but which should be enough to get you the info you want from UniProt.

I'm not quite sure why your above query did not return any results as I am not familiar with python urllib functions. But I suspect that your url is improperly formatted. For an example, the following url will produce the results that you desire for gene pyrH from taxonomy 391774

https://www.uniprot.org/uniprot/?query=gene:pyrH+and+taxonomy:391774&&columns=id,entry_name,reviewed&format=tab

You'll want to make sure that your code is producing a url that is formatted as shown above. To query for multiple genes, either loop this for each gene or see the below example:

https://www.uniprot.org/uniprot/?query=gene:(pyrH+or+rpoC+or+tsf)+and+taxonomy:391774&&columns=id,entry_name,reviewed&format=tab

ADD REPLY

Login before adding your answer.

Traffic: 2086 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6