Retrieving Pubchem Ids
2
3
Entering edit mode
12.7 years ago
Nitin ▴ 170

Hi all,

I have list of compound names for which i want to retrieve Pubchem CIDs..to acheive this i wrote a biopython script as follows but it doenst seem to working

from Bio import Entrez

Entrez.email = "sainitin7@gmail.com"

infile = open("data", "r")

out_put = open("ids_data.csv","w")

for line in infile.readlines():

  single_id = line

  #Post list of ids to database

  handle= Entrez.epost("pccompound",names=single_id)

  record = Entrez.read(handle)

  #history

  webEnv=record["WebEnv"]

  queryKey=record["QueryKey"]

  #Retreiving information

  data = Entrez.esummary(db="pccompound",webenv=webEnv,query_key=queryKey)

  res=Entrez.read(data)

  for compound in res:    

    Name = compound["SynonymList"]

    Cid = compound["Id"]

    print "%s:%s" %(Name,Cid)

    out_put.write("%s:%s\n" %(Name,Cid))

out_put.close()

Ideally i want a output as follows

Biruvidine : 446727

Can any body help

Thanks in advance

Nit

biopython entrez ncbi • 5.8k views
ADD COMMENT
0
Entering edit mode

Can you fix the formatting? The example is very hard to read, and you didn't show the current output. It sounds like given an PubChem identifier like SID 74891762 you want to get back 'Brivudine: CID446727' - is that right?

ADD REPLY
0
Entering edit mode

Can you fix the formatting? The example is very hard to read.

ADD REPLY
0
Entering edit mode

And what do you mean by "it doenst seem to working"? What is the error message, if any?

ADD REPLY
0
Entering edit mode

Thanks for fixing the formatting. Could you also include an example of the text in ids_data.csv so we have both sample input AND the desired output?

ADD REPLY
1
Entering edit mode
12.7 years ago

a) You are not using the right tools. Here is a simpler and more robust solution using the Cactvs Chemoinformatics toolkit www.xemistry.com/academic for free academic version):

foreach name [split [string trim [read_file data]] "\n"] {
        if {[catch {ens create $name} eh]} {
                puts "$name : not resolved"
        } elseif {[catch {ens get $eh E_CID} cid]} {
                puts "$name: no CID"
                ens delete $eh
        } else {
                puts "$name: $cid"
                ens delete $eh
        }
}

b) Even this script does not work with "Biruvidine". Because the proper name of that compound is "Brivudine". The correct name resolves easily.

Interactive lookup of the name set for a CID:

cactvs>ens create CID446727
ens0
cactvs>ens get ens0 E_NAMESET
{5-[(E)-2-bromoethenyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]pyrimidine-2,4-dione} {5-[(E)-2-bromovinyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)tetrahydrofuran-2-yl]pyrimidine-2,4-dione} {5-[(E)-2-bromovinyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)-2-tetrahydrofuranyl]pyrimidine-2,4-dione} {5-[(E)-2-bromovinyl]-1-[(2R,4S,5R)-4-hydroxy-5-methylol-tetrahydrofuran-2-yl]pyrimidine-2,4-quinone} 69304-47-8 (E)-5-(2-Bromovinyl)-2'-deoxyuridine (E)-5-(2-Bromovinyl)-deoxyuridine BVDU {Brivudina [INN-Spanish]} Brivudine {Brivudine [INN]} {Brivudinum [INN-Latin]} {CCRIS 2831} Helpin {NSC 633770} {Uridine, 5-(2-bromoethenyl)-2'-deoxy-, (E)-} {Uridine, 5-(2-bromovinyl)-2'-deoxy-, (E)-} trans-5-(2-Bromovinyl)-2'-deoxyuridine Lopac0_000175 (E)-5-(2-Bromovinyl)-dUrd AIDS-070967 AIDS070967 BV-dUrd BrVdUrd Brivudin UA-618 EU-0100175 A-176 5-BROMOVINYLDEOXYURIDINE BVD Bromovinyldeoxyuridine RP-101 Zostex
ADD COMMENT
0
Entering edit mode

What do you mean 'You are not using the right tools'? He's using the NCBI Entrez API to query the NCBI PubChem database which seems like a sensible idea.

ADD REPLY
0
Entering edit mode

Yes, of course it is. But he is burdening himself with all the details, and that can be avoided.

Of course the Cactvs solution uses the same API behind the scenes (for the structure to CID part, the name resolution is actually primarily relying on the more extensive NCI resolver and uses PubChem/Entrez only as a fallback, so this solution will work with cpd names that are not in PubChem). The toolkit code implements error checking, has implicit retrial and timeout handling, etc. Entrez is not exactly the most robust interface in practical operation.

ADD REPLY
1
Entering edit mode
12.7 years ago
Peter 6.0k

I would have expected to do this with EFetch, but the NCBI don't seem to support this database with EFetch http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html

Here's how I would do it for one ID, a PubChem identifier like CID 446727 - you can of course generalise this to read the IDs from a file etc and use epost and the history as you were above.

from Bio import Entrez
Entrez.email = "sainitin7@gmail.com"
record = Entrez.read(Entrez.esummary(db="pccompound", id="446727", retmode="xml"))
for entry in record:
    print entry['SynonymList']

I presume that the first synonym is the one you want. In this case the list you get back is:

['Brivudine', 'BVDU', 'Helpin', 'Brivudin', "(E)-5-(2-Bromovinyl)-2'-deoxyuridine", 'Bromovinyldeoxyuridine', 'Brivudinum [INN-Latin]', 'Brivudina [INN-Spanish]', 'CCRIS 2831', '69304-47-8', 'Brivudine (INN)', 'Brivudine [INN]', "Uridine, 5-(2-bromoethenyl)-2'-deoxy-, (E)-", '(E)-5-(2-Bromovinyl)-deoxyuridine', 'NSC 633770', "trans-5-(2-Bromovinyl)-2'-deoxyuridine", "Uridine, 5-(2-bromovinyl)-2'-deoxy-, (E)-", 'Zostex', 'BVD', '5-BVDU', 'E-5-(2-bromovinyl)-dUrd', 'Z-5-(2-bromovinyl)-dUrd', 'Brivudinum', 'Brivudina', 'BrVdUrd', 'NSC633770', 'BV-dUrd', "5-(2-bromovinyl)-2'-deoxyuridine", 'Bromvinyldesoxyuridin', "5-(2-bromoethenyl)-2'-deoxyuridine", 'Zostex (TN)', "(Z)-5-(2-bromovinyl)-2'-deoxyuridine", 'Lopac0_000175', 'C11H13BrN2O5', 'CHEMBL31634', '5-BROMOVINYLDEOXYURIDINE', 'AC1L9K12', '(E)-5-(2-Bromovinyl)-dUrd', 'UNII-2M3055079H', 'RP-101', 'UA-618', 'CCG-204270', 'NCGC00093656-01', 'NCGC00093656-02', 'NCGC00093656-03', 'LS-160809', 'A-176', 'EU-0100175', 'B 9647', 'D07249', '5-[(E)-2-bromoethenyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]pyrimidine-2,4-dione']
ADD COMMENT
1
Entering edit mode

Just download the complete CID-synonym file from the Compound "Extras" folder and grep for the name and all CIDs associated. Or use eUtils which is supported.

ADD REPLY
0
Entering edit mode

I think this posts answers the reverse of the question - going from CID to name set. The problem was getting the CID from a name.

ADD REPLY
0
Entering edit mode

You're probably right. It would have helped it the question included an example input as well as the hoped for output.

ADD REPLY
0
Entering edit mode

The basic idea is correct, though, i.e. use esummary.

ADD REPLY
0
Entering edit mode

Starting with a name, use esearch?

ADD REPLY

Login before adding your answer.

Traffic: 2209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6