How To Retrieve The Crossreferences To Other Databases From Pubchem Compounds
6
2
Entering edit mode
13.1 years ago
Pablacious ▴ 630

I have a list of nearly 10,000 PubChem compounds identifiers, I want to retrieve the references that PubChem has for those compounds to other databases (like ChEBI, ChEMBL, ChemSpider, LipidMaps, EINECS, etc). For instance:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9891771

Has cross references for ChEMBL, ChEBI and LipidMaps (which can be seen in the sections "Depositor Supplied Synonyms" within "Identification and Related Records" and in "Substance Categorization Classification" within "Classification").

I have tried with the ASN.1 download, the SDF (which doesn't include these fields in the mol annotation), the web service and the download facility without much success. Maybe I'm doing something wrong with the web service.

If any one knows or have achieved this, I would really appreciate some help.

chemoinformatics ncbi webservice • 8.3k views
ADD COMMENT
0
Entering edit mode

Easily done with the CACTVS toolkt www.xemistry.com/academic has a free version for academic use).

Script snippet:

foreach cid $cidlist { set eh [ens create $cid] if {![catch {ens get $eh E_CHEBI_ID} id]} { puts "ChEBI: $id" }

same for other identifiers of interest, the only one from your list currently not supported is LipidMaps (I'll add it), ChemSpider and EINECS require version 3.395 because their query interface once more has morphed

ens delete $eh }

The code performs a fresh lookup on the reference databases, so it does not require registration of the structures at PubChem.

ADD REPLY
3
Entering edit mode
13.1 years ago

The trick is to look at PubChem Substance: In your case, this retrieves 8 source substances. For each substance, you can see the data source with the associated external id. The same data is contained in the PubChem Substance download files, together with the PubChem compound id.

This only works if the databases you care about actively deposit their compounds in PubChem. E.g. AFAIK it won't work for CAS.

ADD COMMENT
1
Entering edit mode
13.1 years ago

You can also use Bio2RDF for discovering links (which you can easily automate), by following the http://bio2rdf.org/bio2rdf_resource:linkedToFrom, http://bio2rdf.org/bio2rdf_resource:xRef, and http://www.w3.org/2002/07/owl#sameAs links recursively.

For example, follow the :linkedToFrom for:

http://bio2rdf.org/page/pubchem:7847069

ADD COMMENT
1
Entering edit mode
13.1 years ago
Pablacious ▴ 630

For future reference, this is the detailed procedure that I followed. I used the Eutils web service access from NCBI.

The first step was to submit a POST request using ELink, like in this example (Java, using Jersey as HTTP client):

(baseURL is always: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)

WebResource webRes = client.resource(baseURL + "elink.fcgi");
MultivaluedMap queryParams = new MultivaluedMapImpl();
queryParams.add("dbfrom", "pccompound");
queryParams.add("db", "pcsubstance");
queryParams.add("linkname", "pccompound_pcsubstance_same");
for (String id : dbFromIds) { // the dbFromIds is a list of PubChem CIDs
    queryParams.add("id", id);
}
ClientResponse resp = submitPost(webRes, queryParams);

You get an XML response which looks like this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pccompound&id=2906&id=100&db=pcsubstance&linkname=pccompound_pcsubstance_same

Through the post version you can post up to 5000 compound ids at once. This gives you an XML from where you need to extract the compound CID to substance SID associations (when you submit several ids in the post form, you don't lose the compound - substance associations, as shown in the example). You could change the linkname variable to other available flavours, but I wanted the same structures.

Then, for groups of 5000 substance ids (SIDs, in the pubchemSubstanceIds list), you make a submission to the EPost application:

WebResource epostWebRes = client.resource(baseURL+"epost.fcgi");
MultivaluedMap queryParamsEPost = new MultivaluedMapImpl();
queryParamsEPost.add("db", "pcsubstance");
queryParamsEPost.add("id", StringUtils.join(pubchemSubstanceIds, ","));
ClientResponse respEpost = submitPost(epostWebRes, queryParamsEPost);

From the response, you obtain two values, a WebEnv and a query_key, which you can use with ESummary:

WebResource webRes = client.resource(baseURL + "esummary.fcgi");
MultivaluedMap queryParams = new MultivaluedMapImpl();
queryParams.add("db", "pcsubstance");
queryParams.add("query_key", epostRes.getQueryKey());
queryParams.add("WebEnv", epostRes.getWebEnv());
ClientResponse resp = submitPost(webRes, queryParams);

This last response includes an XML again from where you can parse names, synonyms, the source identifier (the identifier in the external database) and the source name (the database name) for each submitted pubchem substance id. With the source identifier and source name, you have a cross reference. In the synonyms you can also find identifiers to other databases that don't deposit directly to PubChem (like the HSDB or EINECS, as Michael Kuhn pointed out).

You need to keep in mind that you shouldn't make request with intervals of less than 3 seconds according to the EUtils rules. Even with this, for 14,000 PubChem CIDs, it took approximately an hour (and that included writing a Lucene index with the results).

ADD COMMENT
0
Entering edit mode
13.1 years ago

Easily done with the CACTVS toolkit www.xemistry.com/academic has a free version for academic use). Script snippet:

foreach cid $cidlist { 
    set eh [ens create $cid] 
    if {![catch {ens get $eh E_CHEBI_ID} id]} { puts "ChEBI: $id" } 
    # same for other identifiers of interest, the only one from your list currently not supported is LipidMaps (I'll see that I can add it), ChemSpider and EINECS IDs require toolkit version 3.395 because their query interface once more has morphed 
    ens delete $eh
}

The sample code performs a fresh lookup at the reference databases, so it does not require registration of the structures at PubChem. 10K cpds will take a while (but it can be scripted multi-threaded if it is urgent, and you want to code a little bit more).

If you want to analyse what is in the PubChem substance records, here is another approach:

foreach cid $cidlist {
   set eh [ens create $cid]
   foreach sid [ens get $eh E_SIDSET] {
      set eh2 [ens create SID$sid]
      echo [ens get $eh E_NCBI_SUBSTANCE_SOURCE(db)]
      ens delete $eh2
   }
   ens delete $eh
}

This version only contacts PubChem for cid and sid resolution.

ADD COMMENT
0
Entering edit mode
13.1 years ago
Anon ▴ 10

Use the Identifier Exchange Service...

ADD COMMENT
0
Entering edit mode
2.8 years ago

Although this question has been asked years ago, this Java project may be helpful for cross-mapping drug ids from different databases ( Drugbank, ChEMBL PubChem, UMLS, TTD, KEGG, ZINC ): https://github.com/iit-Demokritos/drug_id_mapping

You can directly use the resulting TSV file: https://github.com/iit-Demokritos/drug_id_mapping/blob/main/drug-mappings.tsv as long as you use this only for academic/research use and provide a citation to the paper referred in the github repository.

ADD COMMENT

Login before adding your answer.

Traffic: 1671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6