Regular Expression / Similar Code-Pieces For Bioinformatics? (E.G. Rexex For Smiles, Inchi, Inchikey )
3
3
Entering edit mode
13.1 years ago
Lo Sauer ▴ 160

Dear all,

  1. I am thankful for a pointer to regex, matching SMILES, InChi... (and other text based notations of chemical structure). I started with SMILES, now I am off to Inchi.
  2. Do you know of a good regular expression library website for bioinformatics? (that excludes the regexlib-website)

Here's a PREG version for SMILES validation (JavaScript) beyond a length of 5:

x.trim().match(/^([^J][0-9BCOHNSOPrIFla@+\-\[\]\(\)\\=#$]{6,})$/ig)

(generic:)

/^([^J][A-Za-z0-9@+\-\[\]\(\)\\=#$]+)$/

PS: It's not that I am not versed with RegEx, it just feels so senseless reinventing the wheel over and over again.

Searching for pieces of code (sourceforge, bitbucket, github, code.google.com) is still a challenge. Am I alone in this?

PPS: The only letter not appearing on the Periodic Table is the letter "J"

See:

Upd: Added gist

code search chemoinformatics • 7.8k views
ADD COMMENT
1
Entering edit mode

It sounds like you might be re-inventing the wheel. I agree with Michael's answer, and the added benefit is that you will automatically be able to support any file type that CDK or OpenBabel can support. For example, if you just want to check that the input is indeed a molecule, you could make sure that openbabel could convert it

ADD REPLY
0
Entering edit mode

Bad news: Google Code Search is going to be closed soon: http://googleblog.blogspot.com/2011/10/fall-sweep.html

ADD REPLY
3
Entering edit mode
13.1 years ago

Even if a regex decides that what you have looks like a, say, SMILES string, it could still be garbage from a chemical point of view, no? You could test if tools like CDK or OpenBabel can parse it, and if it's parsable, you know what it is...

ADD COMMENT
0
Entering edit mode

it is for input prediction.As such I wouldn't actually use a + quantifier but give leeway e.g. starting with {6,} and restrict the character set to COHNSOFla...

ADD REPLY
2
Entering edit mode
13.1 years ago
Iain ▴ 260

You could check out the Blue Obelisk website.

http://blueobelisk.shapado.com/

"The Blue Obelisk Exchange is the place to ask about the use and development of Open Data, Open Source, and Open Standards: how to perform tasks and solve chemical problems with these, or if an ODOSOS tools is available for some task. Or even to ask if someone can provide such a tool. The questions do not require to be about Blue Obelisk solutions itself; they can be about any ODOSOS chemistry tool, service, or database."

ADD COMMENT
0
Entering edit mode

it looks interesting, with the caveat of humble user numbers.

ADD REPLY
0
Entering edit mode

True, but it is very responsive group.

ADD REPLY
1
Entering edit mode
13.1 years ago

Check the resume and code from a Coding Dojo on parsing SMILES, organized in Barcelona:

It's mostly a set of tests and the infrastructure of a function, but it can get you started. In any case, I guess that unit testing can be useful when coding parsers for complex strings like SMILES.

ADD COMMENT
0
Entering edit mode

Interesting, but from the first look of it, it doesn't seem to be fun implementation than a complete-spec SMILES parser. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6