Is it possible to check if a sequence has a stop codon, without translating first?
2
0
Entering edit mode
8.7 years ago
uki ▴ 40

I recently started playing around with NGS data and Biopython, so this question might come as a bit of no-brainer to some but I havent quite figured it out on my own.

I realize that it's pretty straight forward to check the AA sequence to see if there is a stop symbol, but since I filter those sequences out eventually anyways I figured maybe I can avoid doing a whole bunch of translation() calls:

for rec in read1:
    nucSeq = rec.seq
    aaSeq = Seq.translate(nucSeq)

    if '*' in str(aaSeq):
        starSeqs += 1
    else:   
        # do awesome stuff!

PS: I could obviously do a RegEx search over the sequence for the stop codons, but that's neither pretty nor efficient. I was wondering if there is a method/api in the library that does the black magic for me.

biopython sequencing • 3.5k views
ADD COMMENT
1
Entering edit mode

Just curious, would a regex search really be less efficient, and why?

ADD REPLY
0
Entering edit mode

It would be difficult to really know without testing it, but regexs are usually pretty slow (O(n)) compared to a static string match. I've seen a lot of whining about how Java's regex (which is in everything, like split()) is O(n) when people even provide hints like start/end of line anchors, etc. I mean, it's not a big issue which is what i think you're really getting at - reading the file off the disk is probably 99% of the time taken to execute...

ADD REPLY
5
Entering edit mode
8.7 years ago
jsgounot ▴ 170

You mean doing something with python like this ?

stop_codons = ["TGA", "TAG", "TAA"]
sequence = "ATGAAATGA"
sequence2 = "ATGAAATTA"

is_stop_codon = lambda x : any(x[i:i+3] in stop_codons for i in range(0,len(x),3))

print is_stop_codon(sequence)
print is_stop_codon(sequence2)
ADD COMMENT
1
Entering edit mode
8.7 years ago
John 13k

jsgounot's answer is really great, although it only works for the first frame. I came up with this, but I don't know if it would be any faster than jsgounot's.

def find_all(string):
    idx = 0
    indexes = []
    try:
        while True:
            idx = string.index('T',idx+1)
            if string[idx:idx+3] in ("TGA", "TAG", "TAA"): indexes.append(idx)
    except: return indexes

frameCounter = {1:0, 2:0, 3:0}
for idx in find_all('ACTAGCACTGGCCTGATTCGATCGACTAGCA'):
    frameCounter[(idx%3)+1] += 1

for x,y in frameCounter.items(): print 'Frame:',x,'Stops:',y

---------
output
---------
Frame: 1 Stops: 0
Frame: 2 Stops: 1
Frame: 3 Stops: 2

Ideally you would want to take the input, put it into a multidimentional array where you can stripe across the array to get the DNA in 1 of the 3 frames, then once a stop is found in 1 frame stop looking there in the future. You could also do that with splicing that uses variable jumps, but the more time you spend in python and the less in C/Fortran the slower its going to get.

ADD COMMENT

Login before adding your answer.

Traffic: 1228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6