I am trying to read ORF from a list of sequences and identify the one with maximum length and correspond it to the sequence and the reading frame it belongs to as the output. I wrote the code and it's not quite working properly.
def startstop_codon(dna,frame):
""" TO CHECK THE PRESENCE OF START CODON AND ITS POSITION"""
start_codons=['ATG','atg'] ; stop_codons=['TAA','TAG','TGA','taa','tag','tga'] #referece list of start and stop codons
for i in range(frame,len(dna),3):
codon1=dna[i:i+3]
if codon1 in start_codons:
position1=dna.index(codon1)#getting the index of the start codon
ORF_start=position1+1
for j in range(position1,len(dna),3):
codon2=dna[j:j+3]
if codon2 in stop_codons:
position2=dna.index(codon2)#getting the index of the stop codon
ORF_stop=position2+1
break #terminating the loop when a stop codon is found
break
try:
return len(dna[position1:position2])+2
except UnboundLocalError:
None
def finding_ORF(frame):
"""TO IDENTIFY THE ORF IN THE READING FRAMES AND
IDENTIFY THE LONGEST ORF IN EACH SEQUENCE AND THE POSITION OF THE START CODON"""
sequences=[reads.seq for reads in records] #creating alist of all the sequence reads as elements of the list
ORF_lengths={}
positions={}
max_lenth_ORFs=[]
for read in range(len(records)): #loop for accessing each sequence read from the list
seq=str(sequences[read])
a=startstop_codon(seq,frame) #calling the codon reading function
if a== None: #assigning value for reads with no codons
a = 0
ORF_lengths[records[read].id]=a #creating a dictionary of ORF lengths(value) and sequence IDs(key)
max_len=max(ORF_lengths,key=ORF_lengths.get) #getting the ID of the Sequence read containing the maximum length ORF
#printing the result
print("Longest ORF in Frame %d is of length %d and its ID is %s." % (frame+1,ORF_lengths[max_len],max_len))
"I have only given the functions that do this job from the whole script"
A sample output I am getting is:
Longest ORF in Frame 1 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 2 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 3 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
The lengths are not correct and for somereason, it gets only stuck in a single read as in read 229 here. I spent hours looking for the mistake, but could'nt. can someone please look it up and tell me how I can refine it. THANKS IN ADVANCE!
The problem is with
position1 = dna.index(codon1)
andposition2 = dna.index(codon2)
. If the same codon is present more than once indna
, index() method always returns its smallest/first position. That is why, you get the same ORF length all the time. You can fix this by assigningposition1 = i
andposition2 = j
.Thank you. It is working now. But the issue I am facing now is it's reading only one orf per sequence and if it finds the first one, it stops checking for more. presence of more than ORF per sequence and the second on being longer than the first is complicating the problem. any idea how to overcome this?
Please see my answer.
At a quick glance you may need to show us more of the script, where these functions are actually called. Based on your output I’d guess that your loop over your file is incorrect, rather than your ORF finding.
Since I can run a "Fasta" file, I'm entering it:
for orflen, orf in startstop_codon ('dna.example.fasta', 0): print (orflen, orf)
And the command is not reading the ORFs
This function reads just the sequence and not fasta files directly. You need to parse the fasta file and extract only the sequence before you can run the function. Hope it helps.
Can you please show the entire code for ORF reading?