Regular Expression To Match A Codon Pattern
2
3
Entering edit mode
13.9 years ago
hadasa ★ 1.0k

I have an amino acid pattern DIGD i would like to capture at the rna/DNA level. Each of thes Amino Acids are coded for by more than one codon. Thus to capture the pattern at the DNA/RNA level you need to match either of GAU,GAC followed by either AUU or AUC or AUA, followed by either GGU, GGC,GGA,GGG and so on.

I have a quick regular expressioon, which given the mrna should capture these patterns.

(gau|gac)?(auu|auc|aua)?(ggu|ggc|gga|ggg)?(gau|gac)

but just does not feel right! am i missing something. Someone with an improved approach?

Here is a second attempt

((:?gau|gac)(:?auu|auc|aua)(:?ggu|ggc|gga|ggg)(:?gau|gac))

Example strings to match string1

gauauaggagauaucguuagaggaaaagaucuauuuuaugguaauacacaugaaaguaa

gaauauauaugaaggauugucgaacaaugguguaaaagcucgcuacgaaggugauacug

gcgccacaguauggaaggcuaucacauguaaagcuaaggaagcugauaaauauuuuaga

gcgaagauacugcggcacauaaucgaaauagguugggggaucgguauuuggauug

The last string should not match

codon • 7.6k views
ADD COMMENT
1
Entering edit mode

try to put an example of which strings you would like to match, and which you would like to avoid to match.

ADD REPLY
1
Entering edit mode

I tried GA[uc]AU[UCA]GG[UCAG]G(AC|GU) but it doesn't even match your first example. Are you sure of your pattern?

ADD REPLY
1
Entering edit mode

Your pattern isn't in any of your examples.

ADD REPLY
0
Entering edit mode

Sorry pierre, to make it simple lets assume everything is in lowercase. Am more intrested in the knowing the correct regular expression than perfoming the match.

ADD REPLY
0
Entering edit mode

your example is yet not complete... you should put an example of the output you expect after the regex match. And why the latter example is wrong?

ADD REPLY
0
Entering edit mode

@biorelated I tested my pattern with the '-i' option of egrep, the upper/lower case have no importance here.

ADD REPLY
0
Entering edit mode

Note that regular expressions may have exponentially(!) degrading performance on certain inputs (unless using an engine designed to avoid it) - therefore you might want to investigate just translating to protein sequence and matching that. I learned this the hard way, here is a writeup.

ADD REPLY
0
Entering edit mode

Yeah. The interest is actually on the coding sequence more than the protein. So i could have matched the translated sequence. so wanted to capture the actual codons rather than the resulting peptide. Thanks for the link

ADD REPLY
3
Entering edit mode
13.9 years ago

Oh wait, I get it, your first and second attempts in the question don't match one another. Codon 1 and codon 4 are the same?

EDIT: You really don't need to include the whole codon in your variable match, as only one or two bases vary. The following pattern works just fine.

import re

#p = re.compile('ga(:?u|c)au(:?u|c|a)gg(:?u|c|a|g)ga(:?u|c)')
#changed for readibility
p = re.compile('ga[uc]au[uca]gg[ucag]ga[uc]')
strings = strings = ['gauauaggagauaucguuagaggaaaagaucuauuuuaugguaauacacaugaaaguaa',
 'gaauauauaugaaggauugucgaacaaugguguaaaagcucgcuacgaaggugauacug',
 'gcgccacaguauggaaggcuaucacauguaaagcuaaggaagcugauaaauauuuuaga',
 'gcgaagauacugcggcacauaaucgaaauagguugggggaucgguauuuggauug',
 'gauauuggugau']

for string in strings:
    print(p.search(string))

Produces

<_sre.SRE_Match object at 0x24fde0>
None
None
None
<_sre.SRE_Match object at 0x24fde0>

As Giovanni says, checks for case and sequence validity would be a good addition, but are probably out of the scope of your question.

ADD COMMENT
0
Entering edit mode

Thanks a lot. Any tools you may know of that quickly generates or ca be used to validate regexps? :)

ADD REPLY
1
Entering edit mode
13.9 years ago

First you have to write some examples of sequences that should match your regex, and also some sequences that should not:

>>> good_ones = ('gauauugguggu', 
                  'gacauuggcggu', 
                  'gacauagggggu',
             )
>>> bad_ones = ('gaugauauugguggu',              # repeated codon
                  'auuggcgcu',                  # this case misses one codon
                  'gagauagggggu',               # wrong sequence
                  'acauuggcggu',                # this case misses one nucleotide
             )
>>> difficult_ones = ('GAUAUUgguggu',           # lower/upper case matters?
                  'gacattggcggt',               # T == U?
                  ' gacauagggggu ',             # leading spaces?
                  'ACGAGCgacauaggggguAGCTCGATCG',  # what if the sequence is within another sequence?
             )

Are these example correct? you should decide what you want to do in the latter examples: do you want to distinguish between lower and upper cases? between Us and Ts? what if the sequence has a leading or ending space?

Then, the regex you wrote is almost correct, but you can use a non-grouping operator (?: since you are not interested in making groups. Moreover, you should

>>> import re
>>> m = re.compile("^(?:gau|gac)(?:auu|auc|aua)(?:ggu|ggc|gga|ggg)(?:ggu|gac)$")
>>> for seq in good_ones: m.search(seq)
<_sre.SRE_Match object at 0x1666f38>
<_sre.SRE_Match object at 0x16665e0>
<_sre.SRE_Match object at 0x1666e68>

>>> for seq in bad_ones: m.search(seq)

>>> for seq in difficult_ones: m.search(seq)
???
ADD COMMENT
0
Entering edit mode

within the last set of codons, the firt codon should be gau|gac not ggu|gac :) thanks

ADD REPLY
0
Entering edit mode

more intrested in the correctness of that regular expression. Since i wonna check out the indexes where the matches starts.. :) thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6