I have two CSV files of drug-related data. One has the drug info specified with CHEMBL identifiers, whereas the second file contains DrugBank and PubChem IDs. I need to compare these two files for overlap in their drug contents. Both files contain drug names in string format, but working with those is tricky, since often a single row/drug will contain several synonyms, and accurately matching between the two files seems like it will be challenging, especially since both files are unlikely to contain the same synonyms for a particular drug.
I'm looking for a simple way (e.g. an existing function or website) that will allow me to convert between my CHEMBL IDs in the first file, and my DrugBank & PubChem IDs in the second file. I have performed a fairly extensive search, but am surprised that I'm not finding e.g. an R or Python function, or a web-based tool, that would allow me to do this. [This site is similar to what I need, with lots of options for the "From" format, but unfortunately, no useful options for the "To" format: http://cts.fiehnlab.ucdavis.edu/conversion/batch ]. I also located this Jupyter Notebook (http://nbviewer.jupyter.org/url/git.dhimmel.com/drugbank/unichem-map.ipynb) to match DrugBank compounds to external resources using UniChem, but for my purposes, this Notebook seems far too complex for the simple conversion I'm seeking.
Any suggestions about resources that might assist with this drug ID conversion task will be much appreciated. Thanks!!
This is easily done with the Cactvs Cheminformatics Toolkit (visit www.xemistry.com/academic for free academic packages, it includes both a loadable Python module and a stand-alone Python interpreter with chemistry extensions). The toolkit can decode the three IDs you are using (and many more) into structure objects, and the fastest way to compare these is by computing a structure hashcode. There is no name/synonym matching involved - this purely works on structural connectivity
Here some interactive commands in the Python version, comparing Aspirin via its different DB IDs, and also directly computing the database IDs for structures from a different source:
There is a chemistry-aware table object which helps you with the processing of table data files. I'd be surprised if this required more than 10 lines of script code.
Alternatively, use id mapping provided by unichem https://www.ebi.ac.uk/unichem/. More than 50 databases are processed to provide a full source mapping.
Many thanks, Zhilong! That is exactly what I needed!
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.