What are the similarity algorithms normally used to compare slightly different, but related SMILES strings (e.g. Oc1ccc(cc1)\C=C\C(=O)c2ccc(O)cc2O vs O=C(/C=C/c1ccccc1)c2ccccc2).
What are the similarity algorithms normally used to compare slightly different, but related SMILES strings (e.g. Oc1ccc(cc1)\C=C\C(=O)c2ccc(O)cc2O vs O=C(/C=C/c1ccccc1)c2ccccc2).
See this by Andrew Dalke.
In it, he references:
Lingos, Finite State Machines, and Fast Similarity Searching", J. A. Grant, J. A. Haigh, B. T. Pickup, A. Nicholls, and R. A. Sayle, J. Chem. Inf. Model 46(5) (2006) p1912-1918.
He also looks at using compression via zlib to look at compression.
Comparing SMILES directly only makes sense when you use canonical SMILES. More common is to process the SMILES in a chemical graph, and compare the actual graphs, so that it does not matter that you can have multiples SMILES for the same molecule. From then on, I suggest the fingerprint as representation for which you can calculate the similarity with the Tanimoto distance.
Example code using the CDK and R can be found in this vignette using the rcdk package.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
To expand...there can be many SMILES strings for the same chemical structure, so it doesn't make sense to compare the strings themselves.