Question

Regular Expression To Match A Codon Pattern

3

Entering edit mode

14.5 years ago

hadasa ★ 1.0k

I have an amino acid pattern DIGD i would like to capture at the rna/DNA level. Each of thes Amino Acids are coded for by more than one codon. Thus to capture the pattern at the DNA/RNA level you need to match either of GAU,GAC followed by either AUU or AUC or AUA, followed by either GGU, GGC,GGA,GGG and so on.

I have a quick regular expressioon, which given the mrna should capture these patterns.

(gau|gac)?(auu|auc|aua)?(ggu|ggc|gga|ggg)?(gau|gac)

but just does not feel right! am i missing something. Someone with an improved approach?

Here is a second attempt

((:?gau|gac)(:?auu|auc|aua)(:?ggu|ggc|gga|ggg)(:?gau|gac))

Example strings to match string1

gauauaggagauaucguuagaggaaaagaucuauuuuaugguaauacacaugaaaguaa

gaauauauaugaaggauugucgaacaaugguguaaaagcucgcuacgaaggugauacug

gcgccacaguauggaaggcuaucacauguaaagcuaaggaagcugauaaauauuuuaga

gcgaagauacugcggcacauaaucgaaauagguugggggaucgguauuuggauug

The last string should not match

codon • 8.4k views

ADD COMMENT • link updated 14.5 years ago by Simon Cockell 7.4k • written 14.5 years ago by hadasa ★ 1.0k

1

Entering edit mode

try to put an example of which strings you would like to match, and which you would like to avoid to match.

ADD REPLY • link 14.5 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

I tried GA[uc]AU[UCA]GG[UCAG]G(AC|GU) but it doesn't even match your first example. Are you sure of your pattern?

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 14.5 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Your pattern isn't in any of your examples.

ADD REPLY • link 14.5 years ago by Simon Cockell 7.4k

0

Entering edit mode

Sorry pierre, to make it simple lets assume everything is in lowercase. Am more intrested in the knowing the correct regular expression than perfoming the match.

ADD REPLY • link 14.5 years ago by hadasa ★ 1.0k

0

Entering edit mode

your example is yet not complete... you should put an example of the output you expect after the regex match. And why the latter example is wrong?

ADD REPLY • link 14.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

@biorelated I tested my pattern with the '-i' option of egrep, the upper/lower case have no importance here.

ADD REPLY • link 14.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Note that regular expressions may have exponentially(!) degrading performance on certain inputs (unless using an engine designed to avoid it) - therefore you might want to investigate just translating to protein sequence and matching that. I learned this the hard way, here is a writeup.

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 14.5 years ago by Istvan Albert 102k

0

Entering edit mode

Yeah. The interest is actually on the coding sequence more than the protein. So i could have matched the translated sequence. so wanted to capture the actual codons rather than the resulting peptide. Thanks for the link

ADD REPLY • link 14.5 years ago by hadasa ★ 1.0k

Ram · Answer 1 · 2011-01-24

Oh wait, I get it, your first and second attempts in the question don't match one another. Codon 1 and codon 4 are the same?

EDIT: You really don't need to include the whole codon in your variable match, as only one or two bases vary. The following pattern works just fine.

import re

#p = re.compile('ga(:?u|c)au(:?u|c|a)gg(:?u|c|a|g)ga(:?u|c)')
#changed for readibility
p = re.compile('ga[uc]au[uca]gg[ucag]ga[uc]')
strings = strings = ['gauauaggagauaucguuagaggaaaagaucuauuuuaugguaauacacaugaaaguaa',
 'gaauauauaugaaggauugucgaacaaugguguaaaagcucgcuacgaaggugauacug',
 'gcgccacaguauggaaggcuaucacauguaaagcuaaggaagcugauaaauauuuuaga',
 'gcgaagauacugcggcacauaaucgaaauagguugggggaucgguauuuggauug',
 'gauauuggugau']

for string in strings:
    print(p.search(string))

Produces

<_sre.SRE_Match object at 0x24fde0>
None
None
None
<_sre.SRE_Match object at 0x24fde0>

As Giovanni says, checks for case and sequence validity would be a good addition, but are probably out of the scope of your question.

Ram · Answer 2 · 2011-01-24

First you have to write some examples of sequences that should match your regex, and also some sequences that should not:

>>> good_ones = ('gauauugguggu', 
                  'gacauuggcggu', 
                  'gacauagggggu',
             )
>>> bad_ones = ('gaugauauugguggu',              # repeated codon
                  'auuggcgcu',                  # this case misses one codon
                  'gagauagggggu',               # wrong sequence
                  'acauuggcggu',                # this case misses one nucleotide
             )
>>> difficult_ones = ('GAUAUUgguggu',           # lower/upper case matters?
                  'gacattggcggt',               # T == U?
                  ' gacauagggggu ',             # leading spaces?
                  'ACGAGCgacauaggggguAGCTCGATCG',  # what if the sequence is within another sequence?
             )

Are these example correct? you should decide what you want to do in the latter examples: do you want to distinguish between lower and upper cases? between Us and Ts? what if the sequence has a leading or ending space?

Then, the regex you wrote is almost correct, but you can use a non-grouping operator (?: since you are not interested in making groups. Moreover, you should

>>> import re
>>> m = re.compile("^(?:gau|gac)(?:auu|auc|aua)(?:ggu|ggc|gga|ggg)(?:ggu|gac)$")
>>> for seq in good_ones: m.search(seq)
<_sre.SRE_Match object at 0x1666f38>
<_sre.SRE_Match object at 0x16665e0>
<_sre.SRE_Match object at 0x1666e68>

>>> for seq in bad_ones: m.search(seq)

>>> for seq in difficult_ones: m.search(seq)
???