Given Several Compound Reference Numbers, How To Get The Molecular Files
4
1
Entering edit mode
13.5 years ago
Flow ★ 1.6k

I have a list of chemical compounds from several vendors, like sigma aldrich, for them I have the reference number, for example N5023 for sigma aldrich is the compound Nordihydroguaiaretic acid, etc. This is a long list with 10000 compounds, most of them from SA. I wonder what would be the best approach to, starting from the list of compounds with the reference numbers, to get all structure files in some standard format like smiles, pdb, mol2, etc, ready for docking calculations.

structure database • 4.8k views
ADD COMMENT
3
Entering edit mode
13.5 years ago
Rich Apodaca ▴ 170

Your primary key is issued and maintained by Sigma-Aldrich, which means the first source for the information you seek should be them. You can apparently request sections of the catalog in SD form here and here. The SD files might (should) contain molfiles, from which you can convert to the format of your choosing. A personal call to someone at Aldrich explaining your use case might also be helpful.

If this doesn't work out, you'll have more work to do. PubChem contains about 35,000 Sigma-Aldrich records as of today (find them by searching by Sigma-Adrich as supplier with no other parameter). So your records may or may not be there.

If you still have gaps, you might want to consider using another primary key.

Do you have (or can you get) IUPAC or trivial names? If so, gChem could help you search via Google Spreadsheets and pull molfiles via the (recently-added) getSDF function.

gChem is based on NCI's Chemical Identifier Resolver, which you can experiment with directly.

ADD COMMENT
0
Entering edit mode

very very good advice, will try it

ADD REPLY
2
Entering edit mode
13.5 years ago

The most convenient and powerful way to access both the quoted name resolver and PubChem is by means of the Cactvs toolkit www.xemistry.com/academic free for academic users).

Unfortunately, your example does not work, since N5023 is not a registered name. But using this tool you can get your structures from recognized names in any of the desired formats, and process the full list with a small script in a batch.

Example code (interactive, you would write a loop over your reference file):

cactvs>ens create N5023
Error: ens create failed: Failed to decode structure data specification
cactvs>ens create "Nordihydroguaiaretic acid"
ens1
cactvs>ens get ens1 E_NAMESET
{4-[(2S,3R)-4-(3,4-dihydroxyphenyl)-2,3-dimethylbutyl]benzene-1,2-diol} {4-[(2S,3R)-4-(3,4-dihydroxyphenyl)-2,3-dimethyl-butyl]benzene-1,2-diol} {4-[(2S,3R)-4-(3,4-dihydroxyphenyl)-2,3-dimethyl-butyl]pyrocatechol} 500-38-9 27686-84-6 334707-72-1 Lopac0_000877 (R*,S*)-4,4'-(2,3-Dimethylbutane-1,4-diyl)bispyrocatechol {1,2-Benzenediol, 4,4'-(2,3-dimethyl-1,4-butanediyl)bis-, (R*,S*)-} Actinex {CHX 100} {EINECS 248-606-6} Masoprocol {Masoprocol [USAN:INN]} {Masoprocolum [INN-Latin]} {Nordihydroguaiaretic acid (meso-form)} meso-4,4'-(2,3-Dimethyltetramethylene)dipyrocatechol meso-NDGA {27686-84-6 (MESO)} AIDS-025463 ZINC00012342 C10719 {Nordihydroguaiaretic acid} {Actinex (TN)} D04862 {Masoprocol (USAN)} EU-0100877 {Nordihyolroguaiaretic acid} AIDS025463 {meso-Nordihydroguaiaretic acid} CHX-100 Lopac-N-5023 NCGC00015741-01 ZINC00056473 NCGC00015741-02 TNP00263
cactvs>ens get ens1 E_CID
71398
cactvs>ens get ens1 E_PUBCHEM_URL
http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=71398
cactvs>ens get ens1 E_CHEMSPIDER_URL
Error: property computation for "E_CHEMSPIDER_URL" failed: property computation for "E_CHEMSPIDER_ID" failed: no chemspider
cactvs>ens get ens1 E_WIKIPEDIA_URL
http://en.wikipedia.org/wiki/Masoprocol
cactvs>molfile write sample1.sdf ens1
ens1
cactvs>ens get ens1 E_GIF
/tmp/cactvs000JXbXdY.gif
cactvs>
ADD COMMENT
1
Entering edit mode
13.5 years ago

I see two options: The PubChem API, and the ChemSpider API. PubChem has more content, but less quality (as it's the union of all the content it can get), whereas ChemSpider has higher quality, but e.g. misses N5023.

Rich Apodaca has posted some nice examples on querying the PubChem API. PubChem will give you the SMILES, InChI, and 2D MOL in an SDF format.

ADD COMMENT
0
Entering edit mode

great! but there is any way to get che CAS number from the Sigma identifier?

ADD REPLY
0
Entering edit mode

I think you can just plug in your identifiers instead of the CAS numbers. AFAIK, PubChem doesn't treat CAS numbers any different from other kinds of ids.

ADD REPLY
0
Entering edit mode
13.1 years ago
Anon ▴ 10

Not sure why people think ChemSpider has higher quality... I guess it might after someone notices an error and then fixes it... but is that really happening to many records?

ADD COMMENT

Login before adding your answer.

Traffic: 1688 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6