Gene Ontology & Interpro

0

Entering edit mode

10.1 years ago

stackf01 ▴ 20

Hello guys. How do I download the complete data sets for protein entries containing information about the GO (such as Biological Process, Molecular Function, Cellular Component) ? I want to download all this data sets and integrate it in a MySQL db.

Furthermore, second question is that how do I complete data sets from InterPro (domain) which a contains fields about super-family, family, sub-family? Which file should I download there?

Please help. Thanks & Regards

gene interpro • 2.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.1 years ago by stackf01 ▴ 20

0

Entering edit mode

10.1 years ago

Pierre Lindenbaum 166k

you could download the xml version of uniprot ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

and transform it with the following XSLT stylesheet:

	<?xml version='1.0' encoding="UTF-8" ?>
	<xsl:stylesheet
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:u="http://uniprot.org/uniprot"
	version='1.0'>

	<xsl:output method="text" />


	<xsl:template match="/">

	create table if not exists entry
	(
	id INTEGER PRIMARY KEY,
	accession varchar(50),
	name varchar(50)
	);
	create unique index if not exists entryacn on entry(accession);

	create table if not exists entry2go
	(
	entry_id INTEGER not null,
	term varchar(50) not null,
	FOREIGN KEY(entry_id) REFERENCES entry(id)
	);
	BEGIN TRANSACTION;
	<xsl:apply-templates select="u:uniprot/u:entry"/>
	COMMIT;
	</xsl:template>

	<xsl:template match="u:entry">insert into entry(accession,name) values('<xsl:value-of select="u:accession[1]/text()"/>','<xsl:value-of select="u:name[1]/text()"/>');
	<xsl:apply-templates select="u:dbReference[@id and @type='GO']"/>
	</xsl:template>

	<xsl:template match="u:dbReference">insert into entry2go(entry_id,term) select max(id), '<xsl:value-of select="@id"/>' from entry;
	</xsl:template>

	</xsl:stylesheet>

view raw uniprot2sqlite.xsl hosted with ❤ by GitHub

e.g with only one entry.

$ rm -f tmp.sqlite3 && curl "http://www.uniprot.org/uniprot/O35516.xml" | xsltproc uniprot2sqlite.xsl - | sqlite3 tmp.sqlite3 && sqlite3 tmp.sqlite3 'select * from entry; select * from entry2go;'

1|O35516|NOTC2_MOUSE

1|GO:0009986
1|GO:0005929
1|GO:0005829
1|GO:0005576
1|GO:0005887
1|GO:0016020
1|GO:0005654
1|GO:0005634
1|GO:0005886
1|GO:0043235
1|GO:0005509
1|GO:0019899
1|GO:0051059
1|GO:0060413
1|GO:0046849
1|GO:0007050
1|GO:0001709
1|GO:0016049
1|GO:1990705
1|GO:0061073
1|GO:0042742
1|GO:0007368
1|GO:0030326
1|GO:0072104
1|GO:0072015
1|GO:0001947
1|GO:0072574
1|GO:0006959
1|GO:0001701
1|GO:0002437
1|GO:0072602
1|GO:0035622
1|GO:0070986
1|GO:0001889
1|GO:0072576
1|GO:0002011
1|GO:0035264
1|GO:0043011
1|GO:0008285
1|GO:0000122
1|GO:0007219
1|GO:0009887
1|GO:0060674
1|GO:0001890
1|GO:0043065
1|GO:0030513
1|GO:0008284
1|GO:0045672
1|GO:0046579
1|GO:0072014
1|GO:0003184
1|GO:0006357
1|GO:0006351
1|GO:0042060

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 10.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

10.1 years ago

me ▴ 760

You can download this information directly using the uniprot web service at www.uniprot.org

Use the customize columns button to select which columns you want to download.

e.g. http://www.uniprot.org/uniprot/?query=&columns=id,go(biological%20process),go(molecular%20function),go(cellular%20component)

Then select a tab or comma separated download (select compressed as well for best results)

You might want to write a script to use offset and limit to page through the results as it will generate a largish files.

Unlike the answer using XML from FTP this will give all current Gene Ontology Annotations not just those made by the UniProt consortium, at the time of the UniProt release. i.e. can be a bit more information than the XML file has.

For the InterPro part see http://www.ebi.ac.uk/interpro/download.html specifically the "Entry relationships tree" download

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 10.1 years ago by me ▴ 760