Downloading dataset of PTM sites from UniProt
2
2
Entering edit mode
7.5 years ago
rshipman ▴ 30

Hello,

I am currently looking to put together a data set of post translation sites from the UniProt. I am looking to download this set of data from the website and store it in a text or csv, the information in question is in the image below encased in a red box.

data in question I am currently working in R with the package UniProt.ws and am having a hard time pinning down where in the package something along these lines can be done. Maybe there is another package or language out there that is better suited for this job, not sure.

What would be the best option here? Is it possible to pull this data down with an R script as I do not want to copy and paste all of these sites for each protein in question. I basically only want the information on PTM / Processing from UniProt.

Any help would be great.

Edit---------------------------------- for user me or those interested ------------------------------------------------------------------------

Thank you user me, this is what I was looking for, just need some help with which data is pulled and how it is displayed. I have never used this software before so it is new to me, do you know of any tutorials that are directly related to using SPARQL with UniProt? It looks like this is quite the useful bit of language.

So this looks good but I am missing some information, mainly that of glycosylation sites. I would like to pull the following information in the image below. So all the PTM that were pulled plus the glyco sites, not sure why they did not get pulled with this script. Example, N-Linked (........) -- I believe this would fall into the "text" column

ptm+glycosites

What was provided by the script you typed is what I need but I need a bit more. This next image is what I am hoping for in the end data set. I would also like the protein entry and name as well if possible. I tried playing with the code but was unable to see how that all works out.

Wanted Dataset Layout

Again, thank you so much for the help! Your write up has been great and any resources you can point me in the direction of would be great, this tool is amazing! :)

R PTM uniprot • 3.7k views
ADD COMMENT
6
Entering edit mode
7.5 years ago
me ▴ 760

A SPARQL query that gets most of the data

While the different rest service at UniProt are excellent when you are looking at our data in an annotation centric way instead of an entry specific way they get cumbersome. I suggest that you use this style of sparql query instead at http://sparql.uniprot.org.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX faldo:<http://biohackathon.org/resource/faldo#> 
SELECT 
       (SUBSTR(STR(?protein), 33) AS ?primaryAccession)
       (SUBSTR(STR(?sequence), 34) AS ?sequenceAccession)
       ?name
       ?begin 
       ?text 
       (SUBSTR(STR(?evidence), 32) AS ?eco)
       ?source
WHERE
{
  ?protein a up:Protein ;
         up:organism taxon:9606 ;  #change the taxid if interested in non human or delete if interested in all
         up:annotation ?annotation ;
         rdfs:label ?name . #this comes from the UniRef graph but is just what we need
  VALUES ?annotationType {
       up:Glycosylation_Annotation 
       up:Modified_Residue_Annotation 
       #add any type of annotation as documented at http://www.uniprot.org/core/
  }
  ?annotation a ?annotationType;
            rdfs:comment ?text ;
            up:range/faldo:begin
            [ faldo:position ?begin ;
                             faldo:reference ?sequence ] .
OPTIONAL {
    [] rdf:object ?annotation ; 
                  up:attribution ?attribution . 
        ?attribution up:evidence ?evidence .
        OPTIONAL {
            ?attribution up:source ?source
        }
    }
}

I added the evidence code in case you are thinking of training some algorithm.

The query selects just the modified residue annotations, if you need more please edit your question and I will adapt this answer.

You can use R sparql module to do most of the the heavy lifting in regards to parsing the outputs. The URIs in the output can be shortend to just accessions and eco codes either in the query or of course in your R code.

Selecting different "types" of annotation

Each type of annotation is given its own class, separated by the predicate "a" in the sparql query.

?x a up:Modified_Residue_Annotation .

or

?y a up:Glycosylation_Annotation .

In the query above they are in the list of arguments given to the VALUES query part.

SPARQL 1.1. allows

Protein names which one to select,

Protein names as recorded in UniProt are tricky. There are different names grouped in interesting ways. You are looking for are submitted and recommended names, with a preference for recommended name in case of a Swiss-Prot entry. There are number of name types, but the fullName one is most likely the one you want.

As there is at most one recommendedName that is easy to get into a query.

OPTIONAL {
    ?protein up:recommendedName/up:fullName ?name .  
 }

Then add ?name to the list of things you want to SELECT. However, any entry can have lots of submittedNames so that is more complicated.

it can be done with a subquery.

  OPTIONAL {
     FILTER(!BOUND(?name))
     {
            SELECT ?protein
                 (GROUP_CONCAT(?fName; separator=', ') as ?name) 
            WHERE{
                 ?protein up:submittedName/up:fullName ?fName .
            } GROUP BY ?protein
      }
   }

This needs to be after the previous OPTIONAL. However, adding it craters performance of the query so its the question if this information is worth it. The third option is to use ?protein rdfs:label ?name . Which comes from the UniRef graph, which has this as a shortcut to be able to regenerate the UniRefXML.

Tutorials and further info

For SPARQL in general I recommend the book Learning SPARQL by Bob du Charme but you can also follow a tutorial I have given in collaboration with the neXtProt for which you can find the materials in this repository. There are also a bunch of videos on youtube about why we provide a SPARQL endpoint for UniProt.

ADD COMMENT
1
Entering edit mode

Further details will need to go into a different answer as I am at max answer length :(

ADD REPLY
0
Entering edit mode

All of this has been a great deal of help! I am very close to what I am looking for. Your videos are great and thank you for the resources for tackling this project. I am still very new to this query language and its getting better each day. If it works for you I have one tweak to add to this script but am unsure how to list them as I need.

It involves the evidence section, the SPARQL url is great but is it possible to also display it by the type of evidence that it is such as: Publication or By similarity or UniRule annotation or Imported? I just would like to see it displayed as the type as text. If this can be done in addition to what I have, pasted below, that would be great. I believe that is one of the last pieces of the puzzle on this front.

PREFIX up:http://purl.uniprot.org/core/ 
PREFIX taxon:http://purl.uniprot.org/taxonomy/ 
PREFIX rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns# 
PREFIX faldo:http://biohackathon.org/resource/faldo# 
SELECT 
       ?name
       (SUBSTR(STR(?protein), 33) AS ?primaryAccession)
       (SUBSTR(STR(?sequence), 34) AS ?sequenceAccession) 
       ?begin 
       ?text 
       ?annotation
       ?evidence
       (SUBSTR(STR(?evidence), 32) AS ?eco) #use for machine learning
WHERE
{
  ?protein a up:Protein ;
         up:organism taxon:9606 ;  #change the taxid if interested in non human or delete if interested in all
         up:annotation ?annotation .
  VALUES ?annotationType {
       up:Glycosylation_Annotation 
       up:Modified_Residue_Annotation 
       #add any type of annotation as documented at http://www.uniprot.org/core/
  }
  ?annotation a ?annotationType;
            rdfs:comment ?text ;
            up:range/faldo:begin
            [ faldo:position ?begin ;
                             faldo:reference ?sequence ] .
OPTIONAL {
    [] rdf:object ?annotation ; 
                  up:attribution/up:evidence ?evidence .
      ?protein up:recommendedName/up:fullName ?name . 
} }

ADD REPLY
0
Entering edit mode

This should be in two OPTIONAL blocks

OPTIONAL {
    [] rdf:object ?annotation ; 
                  up:attribution/up:evidence ?evidence .
 }

And this one

OPTIONAL {
  ?protein up:recommendedName/up:fullName ?name . 
}

If it is in one block then both things must be present but they are individually present or not.

ADD REPLY
0
Entering edit mode

I edited my question in response to your answer, thank you for the help, I have a couple more in the edit.

ADD REPLY
3
Entering edit mode
7.4 years ago
me ▴ 760

Evidence

OPTIONAL {
    [] rdf:object ?annotation ; 
                  up:attribution ?attribution . 
        ?attribution up:evidence ?evidence .
        OPTIONAL {
            ?attribution up:source ?source
        }
    }
}

Not every evidence has a source but when they do they are related via the up:source predicate.

Getting the labels for the ECO code can be done via federated query to the EBI RDF platform which has the full ECO ontology in its OLS part.

SERVICE<https://www.ebi.ac.uk/rdf/services/sparql>{
   ?evidence rdfs:label ?evidenceLabel .
}

Unfortunately when combining it with the above query we run into a bug in the SPARQL engine that we use :(

However, I can hack around it for you by adding this prefix at the top of the query

PREFIX ECO:<http://purl.obolibrary.org/obo/ECO_0000>

and then putting this at the end of the query.

VALUES (?evidenceCode ?evidenceLabel)
{
    {ECO:269 "Inferred from experiment")
    (ECO:314 "Inferred from direct assay")
    (ECO:353 "Inferred from physical interaction")
    (ECO:315 "Inferred from mutant phenotype")
    (ECO:316 "Inferred from genetic interaction")
    (ECO:270 "Inferred from expression pattern")
    (ECO:250 "Inferred from sequence or structural similarity")
    (ECO:266 "Inferred from sequence orthology")
    (ECO:247 "Inferred from sequence alignment")
    (ECO:255 "Inferred from sequence model")
    (ECO:317 "Inferred from genomic context")
    (ECO:318 "Inferred from biological aspect of ancestor")
    (ECO:319 "Inferred from biological aspect of descendant")
    (ECO:320 "Inferred from key residues")
    (ECO:321 "Inferred from rapid divergence")
    (ECO:245 "Inferred from reviewed computational analysis")
    (ECO:304 "Traceable author statement")
    (ECO:303 "Non-traceable author statement")
    (ECO:305 "Inferred by curator")
    (ECO:307 "No biological data available")
    (ECO:501 "Inferred from electronic annotation")
    (ECO:312 "Manually imported")
    (ECO:313 "Automatically imported")
    (ECO:256 "Automatically inferred from sequence model")
    (ECO:244 "Combinatorial evidence used in manual assertion")
    (ECO:213 "Combinatorial evidence used in automatic assertion")
    (ECO:260 "Match to InterPro member signature evidence used in manual assertion")
    (ECO:259 "Match to InterPro member signature evidence used in automatic assertion")
    }
     FILTER(sameTerm(?evidenceCode, ?evidence))
  }

This basically builds a temp table inside the query and matches the labels as in use inside the UniProt.org website code base (that is where I got the list from ;)

ADD COMMENT
1
Entering edit mode

THANK YOU SO MUCH! This was a lot of help and a view into a world of tools I was unaware of. Pretty much self taught in this field so this site and you have been awesome! :) Everything should work out, I now have a data set to continue forward with this project. Really appreciate this.

ADD REPLY
2
Entering edit mode

Just upvote and accept the answers ;) that helps me and the site :) also have a look at my EBI friends work at http://www.ebi.ac.uk/rdf/

ADD REPLY

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6