Is there a file about relationship between the cmap name In connectivity map and compound structure (maybe PubChem ID or Drugbank ID)
2
2
Entering edit mode
10.6 years ago
Zhilong Jia ★ 2.2k

Is there a file describe the relationship between cmap name (connectivity map) and CID (PubChem) or DBxxx (drugbank)?

I have a list of cmap name (drug name) file. And I want to get the structure file (such as smi or sdf). But in PubChem or Drugbank, some cmap names are not in this database.

Thank you.

drugbank cmap pubchem • 6.4k views
ADD COMMENT
0
Entering edit mode

It would make everyones life easier if Connectivity Map became a PubChem submitting source (anyone from that crew listening?) then the mappings are taken care of and advanced analysis becomes possible inside PubChem (e.g. exactly which ones are in cmap and/or/not DrugBank

ADD REPLY
0
Entering edit mode

Anyone from the cmap team actually following this post?

ADD REPLY
0
Entering edit mode

I mailed the cmap-help, buy no reply so far.

ADD REPLY
0
Entering edit mode
10.6 years ago
wdiwdi ▴ 380

This is a simple scripting task for the Cactvs Cheminformatics Toolkit (free academic downloads available at https://www.xemistry.com/academic/).

The scripts below read the original CMAP Excel file (you need to store it as xlsx, there is no table reader for the old xls format) and writes an SDF file with the structure in the CTAB section and both PubChem CID and Drugbank ID as data fields, if they can be determined (there are failures, your observation is correct). Since Drugbank have just completely revamped their interface and turned everything upside down, you also need the latest Drugbank ID retriever property definition, which is not yet included in the current academic packages. You can get it directly from me.

Scripted In Tcl:

set th [table read cmap_instances_02.xlsx colnames 1]
set fh [molfile open cmap_tcl.sdf w writelist "E_CID E_DRUGBANK_ID" writeflags compute]
puts "Process [table get $th nrows] table rows"
table loop $th row {
    set name [lindex $row 2]
    if {[catch {ens create name:$name} eh]} {
        puts stderr "Cannot resolve name $name"
    } else {
        molfile write $fh $eh
        ens delete $eh
    }
}

or scripted in Python (sponsored by Vertex Inc.)

th=Table.Read('cmap_instances_02.xlsx',{'colnames':True})
fh=Molfile('cmap_py.sdf','w',{'writelist':'E_CID E_DRUGBANK_ID','writeflags':'compute'})
print('Process',th.nrows,' table rows')
th.loop(variable='row',function="""
try:
    eh=Ens('name:'+row[2])
    fh.write(eh)
    eh.delete()
except:
    print('Cannot resolve name',row[2])
""")
ADD COMMENT
0
Entering edit mode

Thank you. But as you said,some cmap names are not mapped to PubChem CID well. This is the key point. In pubchem ftp, there is a file CID-Synonym-filtered, which show the relationship between CID and drug synonym name. but the cmap names are special sometimes. Once getting all the CID of cmap name, it's easy to obtain the structual file.

cmap name: The name given to a perturbagen (or group of closely related perturbagens) by cmapcurators. For small molecules, the cmap name is always the recommended ('r') or provisional ('p') INN, if one is available. Otherwise a cmap name is selected from amongst the United States Adopted Name (USAN), the British Approved Name (BAN), the monograph titles from Martindale: The Complete Drug Reference or The Merck Index. Different salts of the same compound are given the same cmap name, unless a specific salt is the INN. For example, the cmap name for both propiomazine hydrochloride and propiomazine maleate is "propiomazine" (which is the rINN) but the cmap name for isosorbide dinitrate is "isoborbide dinitrate" since this is the rINN.

Source: http://www.broadinstitute.org/cmap/help_topics_frames.jsp

ADD REPLY
0
Entering edit mode

OK so cmap should (please) simply submit to PubChem. They can add what they like (but is useful) in the synonym or comment lines of the SID. I can't see why cmap would need any unique names anyway (exept novel strucutures ?) but they can go in. Its only necessary for the chemical structures to be correct (which they have probably curated anyway or run a checker and/or InChIKey intersects pre-submission). Inside PubChem the heuristics of name and synonym merging are looked after during the update of the CID. Users can then query either via the SID or the CID feilds, or both in fact ( e.g. all the INNs, USANs and BANs are in there).

ADD REPLY
0
Entering edit mode

I believe cmap name should be checked carefully when inquiring the CID.

for instance,

instance_id , cmap_name, catalog_name
3577, benzylpenicillin,  Benzylpenicillin sodium [69-57-8]

searching benzylpenicillin, CID is 5904; Benzylpenicillin sodium will be SID 23668834; 69-57-8 (CAS) CID 23668834. As a result, the CID should be 23668834.

There are some similar examples, especially in Prestwick_xxx.

And in some cases, the CAS is not the same compound as the cmap_name or catalog_name.

ADD REPLY
0
Entering edit mode

hi, I've found the CID-Synonym-filtered file, but it's nearly 7GB. That's too large to handle. How do you deal with the file? Thanks.

ADD REPLY
0
Entering edit mode

Can you share the link?

ADD REPLY
0
Entering edit mode
10.1 years ago
cdsouthan ★ 1.9k

I just noticed that UniChem have 35320 LINCS compound structures loaded

https://www.ebi.ac.uk/unichem/ucquery/sourceDetails/25

Need to look at file to see what the mapping is - but InChIKey would be useful

ADD COMMENT

Login before adding your answer.

Traffic: 3093 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6