Python'S Regular Expressions To Find All Instances Of A Codon In A Sequence
5
1
Entering edit mode
11.9 years ago

I am trying to find all instances of a specific codon in a given gene or RNA sequence using Python's regular expressions. The findall function seems to be able to do the the job, however, the problem is that one needs to match codons not any subsequent three letters, which may not be part of a codon. Here's an example:

>>> seq='CTCTTACTT'
>>> import re
>>> re.findall(r'CTT',seq)
['CTT', 'CTT']

The first CTT that it finds does not correspond to a codon (CT**CTT**ACTT) since we have only three codons in the given sequence including: CTC, TTA,CTT

Obviously, the most straightforward way is to use a loop, extract each codons from the sequence and compare it with CTT (the codon we are searching for), but I am looking for a smarter way of doing so.

programming python sequence • 8.7k views
ADD COMMENT
2
Entering edit mode
11.9 years ago

Try [m.start() for m in re.finditer('(?=TTT)', 'CTTTTTGTA')]. It will return overlapping instances. In this case: [1,2,3]

ADD COMMENT
1
Entering edit mode
11.9 years ago
David W 4.9k

There will certainly be a more "clever" way of doing it, but for mine, this approach with a generator function and list comprehension gets the job done while being readable:

def codons(seq, frame):
     """Generator function that yields DNA in one-codon blocks 

     returns a tuple containing (codon, position relative to start)
     note: reading frame is 1-based, index for the nucleotide position is 0-based
     """
     start = frame -1
     while start + 3 <= len(seq):
         yield(seq[start:start+3], start)
         start += 3

test = 'CTCTTACTT'
CTT_positions = [p for (c, p) in codons(test,1) if c == 'CTT']

Which will return [6]

ADD COMMENT
1
Entering edit mode
11.9 years ago
Pappu ★ 2.1k

With re in python, you have mention specifically each pattern. Sometimes you might need to find a pattern with indels or snps. In that case Mummer and http://blog.theseed.org/servers/2010/07/scan-for-matches.html will be useful.

ADD COMMENT
0
Entering edit mode
11.9 years ago

[m.start() for m in re.finditer('CTT', 'CTCTTACTT')] will return all the indices. In this case it will return [2,6]. Now you can pick the second occurence as it follows the reading frame. Get all the indices that are fully divisible by three (also include 0).

Thanks

ADD COMMENT
0
Entering edit mode
11.9 years ago

Using finditer is a good idea but it's not gonna work all the times as it finds non-overlapping matches. Here's an example showing why this is not working:

[m.start() for m in re.finditer('TTT', 'CTTTTTGTA')] [1]

The actual position for the codon of interest here is is 3 and not 1, because the sequence is composed of the following codon: CTT TTT GTA

ADD COMMENT
0
Entering edit mode

Something like this should really appear as a comment under the answer it refers to - the "Answers" section is intended for, well, answers to the original question.

ADD REPLY

Login before adding your answer.

Traffic: 1933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6