Extracting Sub-cellular location from Uniprot into tabular format
2
2
Entering edit mode
6.4 years ago
Michael 55k

Hi, here's a question which seems more tricky to solve than it looks initially. I am trying to convert SwissProt accessions into a tabular format for import into SQL containing the "best bet" sub-cellular localization of all proteins (one row per pair (accession, location) ):

Accession Location Evidence
Q9YH95    Nucleus  Manual

Just the way it looks like in the picture in the html page: http://www.uniprot.org/uniprot/Q9YH95 Parsing the XML format would be easy. http://www.uniprot.org/uniprot/Q9YH95.xml contains:

<comment type="subcellular location">
  <subcellularLocation>
     <location evidence="1 3">Nucleus</location>
  </subcellularLocation>
</comment>

Edit: Should be nicely solved using this XSLT by Pierre: How to map sub-cellular localisation to enteries in uniprot database fasta file.

That is not the case for all entries though: e.g. http://www.uniprot.org/uniprot/Q96AT9 and http://www.uniprot.org/uniprot/Q96AT9.xml

<dbReference type="GO" id="GO:0005829">
   <property type="term" value="C:cytosol"/>
   <property type="evidence" value="ECO:0000318"/>
   <property type="project" value="GO_Central"/>
</dbReference>
<dbReference type="GO" id="GO:0070062">
   <property type="term" value="C:extracellular exosome"/>
   <property type="evidence" value="ECO:0007005"/>
   <property type="project" value="UniProtKB"/>
</dbReference>

Does that mean the way to get the full information is:

  1. Parse the <subcellularlocation> for those entries that have it.
  2. Parse GO terms and select those that are coming from "Cellular localization" for the remaining entries using a GO parser?

I noted it would be best to simply reproduce the code that draws the compartment image, does somebody have access to that?

Related but not the same: what is the Query to find proteins which Subcellular location have Manually-assigned evidence in uniprot ?

SwissProt UniProt Parsing • 3.8k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

We are at SIB Swiss-Prot working on UniProt are offsite, wait until thursday ;)

ADD REPLY
3
Entering edit mode
6.4 years ago

It looks like you have meanwhile figured out much of the answer. Here is a summary from the UniProt point of view (all SIB employees were out of town for an instutional event for a couple of days, sorry):

The "Subcellular location" section (https://www.uniprot.org/help/subcellular_location_section) in a UniProtKB entry presents

1) annotations that are directly provided by Swiss-Prot biocurators, in form of a controlled vocabulary (https://www.uniprot.org/locations) complemented by free text notes (in UniProtKB/TrEMBL, such information can also be present, added by the automatic annotation pipeline, https://www.uniprot.org/help/automatic_annotation). See also https://www.uniprot.org/help/subcellular_location

2) GO terms from the Cellular Component ontology (https://www.uniprot.org/help/gene_ontology)

To be complete, you would indeed have to get data from both sources, as they may be complementary. To filter the UniProtKB annotations by manual evidence, you will need to use our Evidence codes (documented here https://www.uniprot.org/help/evidences, searchable via the advanced search and subsequent re-use of the RESTful URLs), and to filter the GO annotations by evidence, you can use https://www.uniprot.org/help/gene_ontology, also combined with the advanced search and the RESTful URLs it creates.

Please don't hesitate to let us know if you have any additional questions or remarks.

ADD COMMENT
0
Entering edit mode
6.4 years ago
Michael 55k

So I got a solution using SQL. First, it looks like assocdb generated by AmiGO is as close as it gets to what I want. This database associates "termdb (above); all manual gene product annotations; electronic annotations (IEA) from all databases other than UniProtKB".

  1. Download the weekly build as SQL tables from here: http://archive.geneontology.org/latest-lite/go_weekly-assocdb-tables.tar.gz You could also download the complete dump and import it into MySQL, but I wanted to import only the required data and use sqlite instead.

  2. Extract the archive into a local directory.

  3. cd to the local dir and open a new sqlite database:

    sqlite3 celloc.db

At the sqlite prompt, run the following code:

-- create schema for the required tables
-- table definitions are the minimal sqlite compatible definitions derived from the MySQL definitions
DROP TABLE IF EXISTS `association`;
  CREATE TABLE `association` (
  `id` int(11) NOT NULL,
  `term_id` int(11) NOT NULL,
  `gene_product_id` int(11) NOT NULL,
  `is_not` int(11) DEFAULT NULL,
  `role_group` int(11) DEFAULT NULL,
  `assocdate` int(11) DEFAULT NULL,
  `source_db_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

DROP TABLE IF EXISTS `term`;
CREATE TABLE `term` (
  `id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL DEFAULT '',
  `term_type` varchar(55) NOT NULL,
  `acc` varchar(255) NOT NULL,
  `is_obsolete` int(11) NOT NULL DEFAULT '0',
  `is_root` int(11) NOT NULL DEFAULT '0',
  `is_relation` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
);

DROP TABLE IF EXISTS `gene_product`;
CREATE TABLE `gene_product` (
  `id` int(11) NOT NULL,
  `symbol` varchar(128) NOT NULL,
  `dbxref_id` int(11) NOT NULL,
  `species_id` int(11) DEFAULT NULL,
  `type_id` int(11) DEFAULT NULL,
  `full_name` text,
  PRIMARY KEY (`id`)
);

DROP TABLE IF EXISTS `dbxref`;
CREATE TABLE `dbxref` (
  `id` int(11) NOT NULL,
  `xref_dbname` varchar(55) NOT NULL,
  `xref_key` varchar(255) NOT NULL,
  `xref_keytype` varchar(32) DEFAULT NULL,
  `xref_desc` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
  );
-- export is tab separated
.separator "\t"

-- import the data from the table flat files
.import term.txt term
.import gene_product.txt gene_product
.import association.txt association
.import dbxref.txt dbxref

-- not required but speeds up further  queries
DELETE FROM dbxref WHERE xref_dbname != 'UniProtKB';
DELETE FROM term where term_type != 'cellular_component';
VACUUM;



-- generate a materialized view
DROP TABLE IF EXISTS uniprot_cellular_localization;
CREATE TABLE uniprot_cellular_localization AS 
 SELECT DISTINCT dbxref.xref_key AS accession, gene_product.symbol, term.name, term.acc
 FROM gene_product
 INNER JOIN  association ON gene_product.id = association.gene_product_id
 INNER JOIN term  ON term.id = association.term_id
 INNER JOIN dbxref ON gene_product.dbxref_id = dbxref.id
 WHERE term.term_type = 'cellular_component';

.headers on

SELECT * FROM uniprot_cellular_localization WHERE accession IN ( 'Q96AT9', 'Q9YH95') ;

-- output:

accession   symbol  name    acc
Q96AT9  RPE cytosol GO:0005829
Q96AT9  RPE extracellular exosome   GO:0070062
Q9YH95  pax5    nucleus GO:0005634
ADD COMMENT
0
Entering edit mode

can we not parse output from third party services like togows (json too large to be pasted here)? parse GO and under GO, extract C

http://togows.org/entry/ebi-uniprot/Q96AT9/dr.json

ADD REPLY
0
Entering edit mode

The associations are unfortunately incomplete too. An example: https://www.uniprot.org/uniprot/Q7Q6R1 has only automatic IEA GO annotations that are omitted by AmiGO and therefore nothing is found, but a manual annotation exists anyway in the Uniprot profile of this protein. Likely we will need How to map sub-cellular localisation to enteries in uniprot database fasta file. in addition. However, applying xsltproc to a 6GB xml file from swissprot hits the wall:

 zcat uniprot_sprot.xml.gz |  xsltproc transform.xsl -
 killed

Running the same on the server yields a file with 496192 lines after the process grew to a memsize of 80GB.

grep -e "Q7Q6R1" sprot_cl.txt
Q7Q6R1  Cell membrane
ADD REPLY

Login before adding your answer.

Traffic: 2547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6