Entering edit mode
2.9 years ago
Space_Life
▴
50
Hi all, I have a long list of gene names (>7000) which is coming from more than 50 organisms. I am not worried about where it is coming from, however, I need to get the most accurate GO terms or at least UniProtKB of these genes so that I can categorize them. The ID mapping of UniProt requires the organism name and this is taking huge time and data overlap. My aim is to be able to categorize these genes based on GO terms and pathways they are associated to. I would appreciate any suggestion. Thank you.
what is the source of the gene names that you have, NCBI/Entrez or Ensembl or something else? Knowing what id is used for your genes may help to provide some examples/suggestions.
That said, you can programmatically query both NCBI and Esembl via API:
https://rest.ensembl.org/
https://www.ncbi.nlm.nih.gov/home/develop/api/
Thank you for replying. The genes are from UniProt. I tried its REST API but it just gets stuck with no output.
I just tested this. If I remove 'taxon' from this code, the output is nothing. Thank you.
First, I think there is some confusion. Your gene names are not accession ids which is what I was inquiring about in my first question. What you have are generic gene symbols and the associated taxonomy but which should be enough to get you the info you want from UniProt.
I'm not quite sure why your above query did not return any results as I am not familiar with
python
urllib
functions. But I suspect that your url is improperly formatted. For an example, the following url will produce the results that you desire for gene pyrH from taxonomy 391774https://www.uniprot.org/uniprot/?query=gene:pyrH+and+taxonomy:391774&&columns=id,entry_name,reviewed&format=tab
You'll want to make sure that your code is producing a url that is formatted as shown above. To query for multiple genes, either loop this for each gene or see the below example:
https://www.uniprot.org/uniprot/?query=gene:(pyrH+or+rpoC+or+tsf)+and+taxonomy:391774&&columns=id,entry_name,reviewed&format=tab