I am using this as a learning exercise, I realize there are existing tools to determine the end result. The goal is to gain experience with Python and nuances of syntax.
Given that I have a fasta file I am parsing, and want to find non-overlapping ORFs in all possible frames (0,1,2). My code I have written thus far is below:
def getORF(seq, treshold, start, stop):
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
for frame in range(3):
q=frame
for i in range(q, len(seq), 3):
current_codon = seq[i:i+3]
if current_codon in start and not start_codon_found:
start_codon_found = True
start_codon_index = i
e=start_codon_index+1
if current_codon in stop and start_codon_found:
end_codon_index = i
d=end_codon_index+3
length = end_codon_index - start_codon_index + 3
if length >= treshold * 3: #treshold is protein length but length is nuc
b=length//3 + 3
start_codon_found = False
print("aa %i, nuc %i, frame %i, coord %i:%i\n" % (b, length, q+1, e, d))
However, when I run the following (myfile contains a single sequence record, but goal is to use multiple fasta record) it only returns results for the first 2 frames.
for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
seq=record.seq
name=record.id
print("%s\n" % name)
getORF(seq,2,start,stop)
I know there are ORFs in frame 3, and I can find start and stop codons using the following scripts:
def test(seq, start, stop):
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
start_codon_index = 0
end_codon_index = 0
q=2
for i in range(q, len(seq), 3):
current_codon = seq[i:i+3]
if current_codon in start:
start_codon_index = i
print("Start at %i " % start_codon_index)
def test2(seq, start, stop):
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
start_codon_index = 0
end_codon_index = 0
q=2
for i in range(q, len(seq), 3):
current_codon = seq[i:i+3]
if current_codon in stop:
stop_codon_index = i
print("Stop at %i " % stop_codon_index)
for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
seq=record.seq
name=record.id
print("%s\n" % name)
test(seq,start,stop)
for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
seq=record.seq
name=record.id
print("%s\n" % name)
test2(seq,start,stop)
I can't figure out why I am getting no results for frame 2. I have even tried throwing in 4 as my range for frame, and instead I get a result for a non existent frame 4 and no results for frame 3.... Clearly I am missing something obvious! Thanks in advance for any help!
Many thanks! I knew I had to be overlooking something simple!