Gene Ontology & Interpro
2
0
Entering edit mode
9.2 years ago
stackf01 ▴ 20

Hello guys. How do I download the complete data sets for protein entries containing information about the GO (such as Biological Process, Molecular Function, Cellular Component) ? I want to download all this data sets and integrate it in a MySQL db.

Furthermore, second question is that how do I complete data sets from InterPro (domain) which a contains fields about super-family, family, sub-family? Which file should I download there?

Please help. Thanks & Regards

gene interpro • 2.4k views
ADD COMMENT
0
Entering edit mode
9.2 years ago

you could download the xml version of uniprot ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

and transform it with the following XSLT stylesheet:

e.g with only one entry.

$ rm -f tmp.sqlite3 && curl "http://www.uniprot.org/uniprot/O35516.xml" | xsltproc uniprot2sqlite.xsl - | sqlite3 tmp.sqlite3 && sqlite3 tmp.sqlite3 'select * from entry; select * from entry2go;'

1|O35516|NOTC2_MOUSE

1|GO:0009986
1|GO:0005929
1|GO:0005829
1|GO:0005576
1|GO:0005887
1|GO:0016020
1|GO:0005654
1|GO:0005634
1|GO:0005886
1|GO:0043235
1|GO:0005509
1|GO:0019899
1|GO:0051059
1|GO:0060413
1|GO:0046849
1|GO:0007050
1|GO:0001709
1|GO:0016049
1|GO:1990705
1|GO:0061073
1|GO:0042742
1|GO:0007368
1|GO:0030326
1|GO:0072104
1|GO:0072015
1|GO:0001947
1|GO:0072574
1|GO:0006959
1|GO:0001701
1|GO:0002437
1|GO:0072602
1|GO:0035622
1|GO:0070986
1|GO:0001889
1|GO:0072576
1|GO:0002011
1|GO:0035264
1|GO:0043011
1|GO:0008285
1|GO:0000122
1|GO:0007219
1|GO:0009887
1|GO:0060674
1|GO:0001890
1|GO:0043065
1|GO:0030513
1|GO:0008284
1|GO:0045672
1|GO:0046579
1|GO:0072014
1|GO:0003184
1|GO:0006357
1|GO:0006351
1|GO:0042060
ADD COMMENT
0
Entering edit mode
9.2 years ago
me ▴ 760

You can download this information directly using the uniprot web service at www.uniprot.org

Use the customize columns button to select which columns you want to download.

e.g. http://www.uniprot.org/uniprot/?query=&columns=id,go(biological%20process),go(molecular%20function),go(cellular%20component)

Then select a tab or comma separated download (select compressed as well for best results)

You might want to write a script to use offset and limit to page through the results as it will generate a largish files.

Unlike the answer using XML from FTP this will give all current Gene Ontology Annotations not just those made by the UniProt consortium, at the time of the UniProt release. i.e. can be a bit more information than the XML file has.

For the InterPro part see http://www.ebi.ac.uk/interpro/download.html specifically the "Entry relationships tree" download

ADD COMMENT
0
Entering edit mode

For the UniProt, how do I which one is the parent node of the ontology ?

ADD REPLY

Login before adding your answer.

Traffic: 2100 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6