Question

Python'S Regular Expressions To Find All Instances Of A Codon In A Sequence

1

Entering edit mode

11.8 years ago

nobodyknowsme57 ▴ 10

I am trying to find all instances of a specific codon in a given gene or RNA sequence using Python's regular expressions. The findall function seems to be able to do the the job, however, the problem is that one needs to match codons not any subsequent three letters, which may not be part of a codon. Here's an example:

>>> seq='CTCTTACTT'
>>> import re
>>> re.findall(r'CTT',seq)
['CTT', 'CTT']

The first CTT that it finds does not correspond to a codon (CT**CTT**ACTT) since we have only three codons in the given sequence including: CTC, TTA,CTT

Obviously, the most straightforward way is to use a loop, extract each codons from the sequence and compare it with CTT (the codon we are searching for), but I am looking for a smarter way of doing so.

programming python sequence • 8.7k views

ADD COMMENT • link updated 7.6 years ago by Biostar 20 • written 11.8 years ago by nobodyknowsme57 ▴ 10

score 2 · Answer 1 · 2013-01-23

2

Entering edit mode

11.8 years ago

Ashutosh Pandey 12k

Try [m.start() for m in re.finditer('(?=TTT)', 'CTTTTTGTA')]. It will return overlapping instances. In this case: [1,2,3]

ADD COMMENT • link 11.8 years ago by Ashutosh Pandey 12k

score 1 · Answer 2 · 2013-01-23

There will certainly be a more "clever" way of doing it, but for mine, this approach with a generator function and list comprehension gets the job done while being readable:

def codons(seq, frame):
     """Generator function that yields DNA in one-codon blocks 

     returns a tuple containing (codon, position relative to start)
     note: reading frame is 1-based, index for the nucleotide position is 0-based
     """
     start = frame -1
     while start + 3 <= len(seq):
         yield(seq[start:start+3], start)
         start += 3

test = 'CTCTTACTT'
CTT_positions = [p for (c, p) in codons(test,1) if c == 'CTT']

Which will return [6]

score 1 · Answer 3 · 2013-01-23

1

Entering edit mode

11.8 years ago

Pappu ★ 2.1k

With re in python, you have mention specifically each pattern. Sometimes you might need to find a pattern with indels or snps. In that case Mummer and http://blog.theseed.org/servers/2010/07/scan-for-matches.html will be useful.

ADD COMMENT • link 11.8 years ago by Pappu ★ 2.1k

score 0 · Answer 4 · 2013-01-23

0

Entering edit mode

11.8 years ago

Ashutosh Pandey 12k

[m.start() for m in re.finditer('CTT', 'CTCTTACTT')] will return all the indices. In this case it will return [2,6]. Now you can pick the second occurence as it follows the reading frame. Get all the indices that are fully divisible by three (also include 0).

Thanks

ADD COMMENT • link 11.8 years ago by Ashutosh Pandey 12k

score 0 · Answer 5 · 2013-01-23

0

Entering edit mode

11.8 years ago

nobodyknowsme57 ▴ 10

Using finditer is a good idea but it's not gonna work all the times as it finds non-overlapping matches. Here's an example showing why this is not working:

[m.start() for m in re.finditer('TTT', 'CTTTTTGTA')] [1]

The actual position for the codon of interest here is is 3 and not 1, because the sequence is composed of the following codon: CTT TTT GTA

ADD COMMENT • link 11.8 years ago by nobodyknowsme57 ▴ 10

0

Entering edit mode

Something like this should really appear as a comment under the answer it refers to - the "Answers" section is intended for, well, answers to the original question.

ADD REPLY • link 11.8 years ago by David W 4.9k