This is a simple scripting task for the Cactvs Cheminformatics Toolkit (free academic downloads available at https://www.xemistry.com/academic/).
The scripts below read the original CMAP Excel file (you need to store it as xlsx, there is no table reader for the old xls format) and writes an SDF file with the structure in the CTAB section and both PubChem CID and Drugbank ID as data fields, if they can be determined (there are failures, your observation is correct). Since Drugbank have just completely revamped their interface and turned everything upside down, you also need the latest Drugbank ID retriever property definition, which is not yet included in the current academic packages. You can get it directly from me.
Scripted In Tcl:
set th [table read cmap_instances_02.xlsx colnames 1]
set fh [molfile open cmap_tcl.sdf w writelist "E_CID E_DRUGBANK_ID" writeflags compute]
puts "Process [table get $th nrows] table rows"
table loop $th row {
set name [lindex $row 2]
if {[catch {ens create name:$name} eh]} {
puts stderr "Cannot resolve name $name"
} else {
molfile write $fh $eh
ens delete $eh
}
}
or scripted in Python (sponsored by Vertex Inc.)
th=Table.Read('cmap_instances_02.xlsx',{'colnames':True})
fh=Molfile('cmap_py.sdf','w',{'writelist':'E_CID E_DRUGBANK_ID','writeflags':'compute'})
print('Process',th.nrows,' table rows')
th.loop(variable='row',function="""
try:
eh=Ens('name:'+row[2])
fh.write(eh)
eh.delete()
except:
print('Cannot resolve name',row[2])
""")
It would make everyones life easier if Connectivity Map became a PubChem submitting source (anyone from that crew listening?) then the mappings are taken care of and advanced analysis becomes possible inside PubChem (e.g. exactly which ones are in cmap and/or/not DrugBank
Anyone from the cmap team actually following this post?
I mailed the cmap-help, buy no reply so far.