How can I get the features of a certain protein from Uniprot?
1
0
Entering edit mode
17 months ago
tidalArms • 0

Hey everyone. So I am trying to design a function in python that uses information from Uniprot in regards to the features a given protein has. The features I am interested in accessing are regions, domains, and secondary structures.

I can access the API already and get the amino acid sequence of a protein of interest using simple code such as:

    import json
    import urllib

    UNIPROT_API_URL = "https://rest.uniprot.org/uniprotkb"
    url =  '{}/{}.json'.format(UNIPROT_API_URL, protein_name)
    uniprot_results = json.load(urllib.request.urlopen(url))
    print(uniprot_results['sequence']['value'])

However, this approach, while getting me some of the information I need, does not get me everything I need for my code. Besides the amino acid sequence of the protein, I also need the features of the protein (e.g. domain, region, and secondary structures), as well as the start and end positions of said features within the amino acid sequence). However, my efforts to locate and retrieve this information from Uniprot have so far been unsuccessful. I know this information is present on Uniprot, as can be seen for this particular entry for A0A075B716 (https://www.uniprot.org/uniprotkb/P08708/entry#ptm_processing). Furthermore, I wanted to try and distinguish between proteins using their taxonomy (e.g. only getting A0A075B716 from humans), and while I know this is possible with a URL such as UNIPROT_API_URL = "https://rest.uniprot.org/uniprotkb/search?query(reviewed:true)%20AND%20(organism_id:9606)", I still am having difficulty trying to figure out how to set up the URL, the query, and other relevant parameters. It seems like this information can be accessed through https://www.ebi.ac.uk/proteins/api/doc/#/, but I'm not sure how the API request can be set up, other than this set up of requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=100&accession=A0A075B716". Which, also gives me no information about the features that I need. Also, given what I have read here (https://groups.google.com/g/ebi-proteins-api/c/4Puf0txfeI8), it seems like the API to EBI is deprecated and works directly from Uniprot now, although how this is possible remains a mystery to me.

I find myself really hitting a wall right now in terms of how I should approach this problem. The code I am basing my work on utilized a .tsv file that accumulated all their relevant Uniprot annotations into a dataframe that looked like this: uniprot df example Which is basically the kind of dataframe I am also trying to generate, but per protein_id. If there is another way to access the Uniprot API that I am not using right now, it would be great to find out.

UPDATE

Ok, so it seems like I was looking at feature information from this protein (https://www.uniprot.org/uniprotkb/P08708/entry), which is an isoform of the protein of interest (https://www.uniprot.org/uniprotkb/A0A075B716/entry). While the former has features (structural information), the latter does not. So I can see why this is a problem. That said, I still am curious as to why my source has feature information for the protein A0A075B716, despite it not being ostensibly present on Uniprot.

uniprot python rest-api api-request • 2.0k views
ADD COMMENT
1
Entering edit mode

You can obtain the proteome data for supported organisms from UniProt FTP site. The .dat files in each proteome folder contain the data in UniProt format. You can parse out anything you need (features are located in FT lines).

FT   REGION          103..126
FT                   /note="Disordered"
FT                   /evidence="ECO:0000256|SAM:MobiDB-lite"
FT   COMPBIAS        111..126
FT                   /note="Basic residues"
FT                   /evidence="ECO:0000256|SAM:MobiDB-lite"

UniProt flat file manual is available here: https://web.expasy.org/docs/userman.html

ADD REPLY
0
Entering edit mode

Thanks GenoMax but I am afraid that, while the Uniprot FTP site does indeed have the information I need, it is quite opaque as how I can access the feature information here. Firstly, the site is organized into Eukaryota, Archaea, Bacteria, and Viruses. Then, it is organized into sections labeled UP000000226, UP000000227, etc., which, given what I've researched, are different proteomes. But these proteome IDs are different from taxonomic IDs (https://www.uniprot.org/help/proteome_id), and it seems like there can be multiple proteomes per one species. So I'm not sure how I can access the protein from the right species, or even how I can access the protein's data within these proteomes using the API.

ADD REPLY
0
Entering edit mode

What do you have in hand that you are trying to search with? Organism names/taxID? This README file contains a complete list of reference proteomes available so you could simply parse the UP* accessions you need and then get the relevant files.

UniProt support (tagging Elisabeth Gasteiger ) stops by periodically and may have a different answer.

ADD REPLY
0
Entering edit mode

Right now I am just trying to search for proteins from humans (tax ID:9606). I have found some luck in modifying my API URL so that it gets the right protein (by accession number and species) (for example, looking for the protein A0A075B716 in humans would require: https://rest.uniprot.org/uniprotkb/search?&query=organism_id:9606&accession:A0A075B716). However, this results in giving me 25 entries for the same, which is much more than what I need. Perusing through these entries show that they have entirely different accession numbers than A0A075B716– they are completely different proteins than what I requested. Interestingly, when I use the URL https://rest.uniprot.org/uniprotkb/search?&query=accession:A0A075B716&organism_id=9606, I get the right protein entry, although it has no feature information whatsoever.

I could use your recommended Uniprot FTP site, which has the human proteome of UP000005640, but all I want is the ability to make an API request that uses the protein's accession number and species of interest to retrieve the protein's sequence, some description information, and features. Downloading the entire proteome definitely seems like it would not be the most efficient approach, and as far as I can tell, I can't access the information in the proteome using the Uniprot API. Also, some of the proteins I have been trying to analyze (namely A0A075B716) are not even present in the UP000005640.dat file.

ADD REPLY
0
Entering edit mode

for example, looking for the protein A0A075B716 in humans would require: https://rest.uniprot.org/uniprotkb/search?&query=organism_id:9606&accession:A0A075B716). However, this results in giving me 25 entries for the same, which is much more than what I need. Perusing through these entries show that they have entirely different accession numbers than A0A075B716– they are completely different proteins than what I requested. Interestingly, when I use the URL https://rest.uniprot.org/uniprotkb/search?&query=accession:A0A075B716&organism_id=9606

Where did you get this query syntax from? It seems that the use of "&" is not correct here, and causes the second clause to be ignored. This explains why the first query would return all human entries (the first 25 of them), while the second one returns A0A075B716 and ignores the organism constraint.

BTW as accession numbers are unique, it is not necessary to include an additional organism constraint.

In order to make sure that an API query returns what you expect, I would recommend that you start with an interactive query on the UniProt website, and once you are sure the results correspond to what you need, click on "Share", then on "Generate URL for API", select your format, click on "Generate URL for API" again and then submit. This will return the URL you can use in your program:

API URL using the streaming endpoint. This endpoint is resource-heavy but will return all requested results.

https://rest.uniprot.org/uniprotkb/stream?format=json&query=%28accession%3AA0A075B716%20AND%20taxonomy_id%3A9606%29

API URL using the search endpoint. This endpoint is lighter and returns chunks of 500 at a time and requires pagination.

https://rest.uniprot.org/uniprotkb/search?format=json&query=%28accession%3AA0A075B716%20AND%20taxonomy_id%3A9606%29&size=500
ADD REPLY
1
Entering edit mode
17 months ago

Ok, so it seems like I was looking at feature information from this protein (https://www.uniprot.org/uniprotkb/P08708/entry), which is an isoform of the protein of interest (https://www.uniprot.org/uniprotkb/A0A075B716/entry). While the former has features (structural information), the latter does not. So I can see why this is a problem. That said, I still am curious as to why my source has feature information for the protein A0A075B716, despite it not being ostensibly present on Uniprot.

I think you obtained some answers from my colleague at the UniProt helpdesk.

A0A075B716 is an unreviewed, automatically generated/annotated entry (it is in UniProtKB/TrEMBL instead of the reviewed section UniProtKB/Swiss-Prot), which explains why it does not have the same level of feature annotations (even no features for this particular entry) as the corresponding reviewed entry P08708.

If you could share the feature annotation you have in your source for A0A075B716, we can try to investigate. The entry history for A0A075B716 is available at https://www.uniprot.org/uniprotkb/A0A075B716/history - I tried to quickly browse through it but was unable to find any features in previous versions of the entry.

If you are working with human entries and are interested in feature annotations, it is probably a good idea to work with reviewed entries instead. Our biocurators are trying their best to be up-to-date with the annotation of characterized human proteins. If you do have TrEMBL entries, you can indeed use the isoform mapping and try to find the canonical sequence that corresponds to your isoform, as explained by my colleague.

Also, some of the proteins I have been trying to analyze (namely A0A075B716) are not even present in the UP000005640.dat file.

The reason why A0A075B716 is not present in UP000005640.dat is that UP000005640.dat only contains the canonical entries. It is however present in the UP000005640_9606_additional.fasta.gz (and the .dat version):

zcat UP000005640_9606_additional.fasta.gz |grep  A0A075B716
>tr|A0A075B716|A0A075B716_HUMAN Isoform of P08708, 40S ribosomal protein S17 OS=Homo sapiens OX=9606 GN=RPS17 PE=1 SV=1

See also: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README for the organization of these files.

NB It would make sense to use integer values for the feature start and end positions instead of keeping the ".0" appended to them.

ADD COMMENT
0
Entering edit mode

Here is what I have meanwhile obtained from one of my biocurator colleagues about the longer isoform A0A075B716 vs the shorter canonical sequence P08708 - which is what you were asking for in your message to the UniProt helpdesk:

The sequence found in entry A0A075B716 is an Ensembl prediction. The RPS17 sequence reported in UniProtKB/Swiss-Prot entry P08708 is 135 amino acid-long and is well-conserved across numerous species. It is encoded by 5 exons. In their transcript ENST00000558397.1, Ensembl predicts a 6th exon (located between exons 4 and 5 of the common 135 amino acid-long isoform). However Ensembl tags this isoform as undergoing " Non stop decay". Non stop decay is the mechanism of identifying and disposing aberrant transcripts that lack in-frame stop codons. This isoform may not exist in vivo, therefore we will not annotate it in UniProtKB/Swiss-Prot unless additional evidence for its existence becomes available.

ADD REPLY

Login before adding your answer.

Traffic: 1828 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6