Dear all,
- I am thankful for a pointer to regex, matching SMILES, InChi... (and other text based notations of chemical structure). I started with SMILES, now I am off to Inchi.
- Do you know of a good regular expression library website for bioinformatics? (that excludes the regexlib-website)
Here's a PREG version for SMILES validation (JavaScript) beyond a length of 5:
x.trim().match(/^([^J][0-9BCOHNSOPrIFla@+\-\[\]\(\)\\=#$]{6,})$/ig)
(generic:)
/^([^J][A-Za-z0-9@+\-\[\]\(\)\\=#$]+)$/
PS: It's not that I am not versed with RegEx, it just feels so senseless reinventing the wheel over and over again.
Searching for pieces of code (sourceforge, bitbucket, github, code.google.com) is still a challenge. Am I alone in this?
PPS: The only letter not appearing on the Periodic Table is the letter "J"
See:
- <script src='https://gist.github.com/1312860></script>
- http://www.google.com/codesearch#/
- http://www.cavdar.net/2008/08/01/my-top-10-source-code-search-engines/ (all of these are pretty much useless for bio/chem-informatics)
Upd: Added gist
It sounds like you might be re-inventing the wheel. I agree with Michael's answer, and the added benefit is that you will automatically be able to support any file type that CDK or OpenBabel can support. For example, if you just want to check that the input is indeed a molecule, you could make sure that openbabel could convert it
Bad news: Google Code Search is going to be closed soon: http://googleblog.blogspot.com/2011/10/fall-sweep.html