I have an amino acid pattern DIGD i would like to capture at the rna/DNA level. Each of thes Amino Acids are coded for by more than one codon. Thus to capture the pattern at the DNA/RNA level you need to match either of GAU,GAC followed by either AUU or AUC or AUA, followed by either GGU, GGC,GGA,GGG and so on.
I have a quick regular expressioon, which given the mrna should capture these patterns.
(gau|gac)?(auu|auc|aua)?(ggu|ggc|gga|ggg)?(gau|gac)
but just does not feel right! am i missing something. Someone with an improved approach?
Here is a second attempt
((:?gau|gac)(:?auu|auc|aua)(:?ggu|ggc|gga|ggg)(:?gau|gac))
Example strings to match string1
gauauaggagauaucguuagaggaaaagaucuauuuuaugguaauacacaugaaaguaa
gaauauauaugaaggauugucgaacaaugguguaaaagcucgcuacgaaggugauacug
gcgccacaguauggaaggcuaucacauguaaagcuaaggaagcugauaaauauuuuaga
gcgaagauacugcggcacauaaucgaaauagguugggggaucgguauuuggauug
The last string should not match
try to put an example of which strings you would like to match, and which you would like to avoid to match.
I tried
GA[uc]AU[UCA]GG[UCAG]G(AC|GU)
but it doesn't even match your first example. Are you sure of your pattern?Your pattern isn't in any of your examples.
Sorry pierre, to make it simple lets assume everything is in lowercase. Am more intrested in the knowing the correct regular expression than perfoming the match.
your example is yet not complete... you should put an example of the output you expect after the regex match. And why the latter example is wrong?
@biorelated I tested my pattern with the '-i' option of egrep, the upper/lower case have no importance here.
Note that regular expressions may have exponentially(!) degrading performance on certain inputs (unless using an engine designed to avoid it) - therefore you might want to investigate just translating to protein sequence and matching that. I learned this the hard way, here is a writeup.
Yeah. The interest is actually on the coding sequence more than the protein. So i could have matched the translated sequence. so wanted to capture the actual codons rather than the resulting peptide. Thanks for the link