Given a presumed DNA sequence. How to check if it is actually a valid sequence using the Bio* toolkits? (BioRuby, Biopython, BioPerl, Bioconductor, BioJava, etc) For example, given:
'actgtactgatcga'
is a valid sequence, because it contains the . But
'actgtactgzzoootcga'
is not a valid sequence (because of letters zzooo).
For the sake of reference, this is the way I am currently parsing in BioSmalltalk:
#dnaSequence asParser matches: 'gtgacttagcgacttagc'
Here are the IUPAC codes for possible letters in DNA sequences:http://www.bioinformatics.org/sms/iupac.html
You can just use a regex to check if any letters other than the IUPAC letters appear in your string.
Thanks Damian. Regular expressions are a form of declarative programming which are not appropriate for my current context. Besides regex are really hard to debug, is too easy to make mistakes, you have to compile them, and possible backtracking during matching, etc.