For future reference, this is the detailed procedure that I followed. I used the Eutils web service access from NCBI.
The first step was to submit a POST request using ELink, like in this example (Java, using Jersey as HTTP client):
(baseURL is always: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)
WebResource webRes = client.resource(baseURL + "elink.fcgi");
MultivaluedMap queryParams = new MultivaluedMapImpl();
queryParams.add("dbfrom", "pccompound");
queryParams.add("db", "pcsubstance");
queryParams.add("linkname", "pccompound_pcsubstance_same");
for (String id : dbFromIds) { // the dbFromIds is a list of PubChem CIDs
queryParams.add("id", id);
}
ClientResponse resp = submitPost(webRes, queryParams);
You get an XML response which looks like this:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pccompound&id=2906&id=100&db=pcsubstance&linkname=pccompound_pcsubstance_same
Through the post version you can post up to 5000 compound ids at once. This gives you an XML from where you need to extract the compound CID to substance SID associations (when you submit several ids in the post form, you don't lose the compound - substance associations, as shown in the example). You could change the linkname variable to other available flavours, but I wanted the same structures.
Then, for groups of 5000 substance ids (SIDs, in the pubchemSubstanceIds list), you make a submission to the EPost application:
WebResource epostWebRes = client.resource(baseURL+"epost.fcgi");
MultivaluedMap queryParamsEPost = new MultivaluedMapImpl();
queryParamsEPost.add("db", "pcsubstance");
queryParamsEPost.add("id", StringUtils.join(pubchemSubstanceIds, ","));
ClientResponse respEpost = submitPost(epostWebRes, queryParamsEPost);
From the response, you obtain two values, a WebEnv and a query_key, which you can use with ESummary:
WebResource webRes = client.resource(baseURL + "esummary.fcgi");
MultivaluedMap queryParams = new MultivaluedMapImpl();
queryParams.add("db", "pcsubstance");
queryParams.add("query_key", epostRes.getQueryKey());
queryParams.add("WebEnv", epostRes.getWebEnv());
ClientResponse resp = submitPost(webRes, queryParams);
This last response includes an XML again from where you can parse names, synonyms, the source identifier (the identifier in the external database) and the source name (the database name) for each submitted pubchem substance id. With the source identifier and source name, you have a cross reference. In the synonyms you can also find identifiers to other databases that don't deposit directly to PubChem (like the HSDB or EINECS, as Michael Kuhn pointed out).
You need to keep in mind that you shouldn't make request with intervals of less than 3 seconds according to the EUtils rules. Even with this, for 14,000 PubChem CIDs, it took approximately an hour (and that included writing a Lucene index with the results).
Easily done with the CACTVS toolkt www.xemistry.com/academic has a free version for academic use).
Script snippet:
foreach cid $cidlist { set eh [ens create $cid] if {![catch {ens get $eh E_CHEBI_ID} id]} { puts "ChEBI: $id" }
same for other identifiers of interest, the only one from your list currently not supported is LipidMaps (I'll add it), ChemSpider and EINECS require version 3.395 because their query interface once more has morphed
ens delete $eh }
The code performs a fresh lookup on the reference databases, so it does not require registration of the structures at PubChem.