Here's a way to use mygene.info to script queries. Say you have this script called mapEnsemblGeneToUniprot.py
:
#!/usr/bin/env python
import sys
import mygene
mg = mygene.MyGeneInfo()
genes = []
for line in sys.stdin:
genes.append(line.strip())
results = mg.querymany(genes, scopes='ensembl.gene', fields=['uniprot'], species='human', verbose=False)
header_written = False
for r in results:
if 'uniprot' in r:
if not header_written:
sys.stdout.write('{}\t{}\t{}\n'.format('Ensembl', 'Swiss-Prot', 'TrEMBL'))
header_written = True
q = r['query']
try:
sp = r['uniprot']['Swiss-Prot']
te = r['uniprot']['TrEMBL']
except KeyError as ke:
pass
sys.stdout.write('{}\t{}\t{}\n'.format(q, sp, te))
Given the text file records.txt
, e.g.:
ENSG00000225235
ENSG00000165359
ENSG00000150676
ENSG00000099308
ENSG00000142676
ENSG00000180776
ENSG00000236953
ENSG00000108848
ENSG00000277370
ENSG00000103811
ENSG00000101473
You could run the script like so:
$ ./mapEnsemblGeneToUniprot.py < records.txt
Ensembl Swiss-Prot TrEMBL
ENSG00000165359 Q5JSJ4 A0A1W2PPI5
ENSG00000150676 Q8IWF9 H0YDV3
ENSG00000099308 O60307 V9GYV0
ENSG00000142676 P62913 ['Q5VVC8', 'Q5VVD0', 'A0A2R8Y447']
ENSG00000180776 Q5W0Z9 ['A0A0D9SEN4', 'B4DRN8']
ENSG00000108848 O95232 ['Q86Y74', 'J3KPP4', 'A8K3C5', 'C9JL41', 'H0YA81', 'U3KQT3', 'H7C5U7', 'D6RHH0', 'E7EN55', 'H0YAX1', 'H0YBV7', 'H0YAR4', 'H0YAY6', 'D6RDI2']
ENSG00000103811 P09668 ['E9PN60', 'A0A0B4J217', 'E9PKT6', 'A0A087X0D5', 'E9PN84']
ENSG00000101473 O14734 ['F6VBM3', 'E9PRD4', 'E9PIS4', 'H7C5A7', 'Q9BR14', 'E9PJN0', 'H0Y698', 'E9PMC4']
Hello NS ,
you could try Biomart.
fin swimmer
Hi, I am unable to reply to the above response and hence writing it here. Thank you fin swimmer for your suggestion. However, why are the IDs being replicated on that? Also, I tried using DAVID gene ID conversion for the same - on DAVID, there are several uniprot IDs for a single ensemble ID; therefore how do we identify which is the first hit of the ID on DAVID?
Thank you.
Post examples of ID's that are exhibiting this
problem
so we can evaluate the issue.As an aside, if you are posting from china using
chrome
browser apparently allows one to useADD REPLY
button.Hi all, Thank you for your support. Incidentally I had also written to Uniprot as query. The response indicates that there is a connectivity problem with job management between mirror sites. I am hoping they rectify it.
Noted on the functionality of the Add Reply button. Thank you.
On DAVID for e.g., ENSG00000179388 returns three Uniprot accession - E5RIM5, B4DH80, Q06889. Is it possible to identify the first hit here?
Kind regards
Mapping between databases is not always 1:1, as they have different philosophies of annotation. In the case of ENSG00000179388, different transcripts of the gene map to different UniProt identifiers.
Hi Emily, Thanks for explaining. Kind regards