ORF reading in Python
1
1
Entering edit mode
5.5 years ago
a.lathifbt ▴ 30

I am trying to read ORF from a list of sequences and identify the one with maximum length and correspond it to the sequence and the reading frame it belongs to as the output. I wrote the code and it's not quite working properly.

def startstop_codon(dna,frame):
    """ TO CHECK THE PRESENCE OF START CODON AND ITS POSITION"""
    start_codons=['ATG','atg'] ; stop_codons=['TAA','TAG','TGA','taa','tag','tga'] #referece list of start and stop codons
    for i in range(frame,len(dna),3):
        codon1=dna[i:i+3]
        if codon1 in start_codons:
            position1=dna.index(codon1)#getting the index of the start codon
            ORF_start=position1+1
            for j in range(position1,len(dna),3):
                codon2=dna[j:j+3]
                if codon2 in stop_codons:
                    position2=dna.index(codon2)#getting the index of the stop codon
                    ORF_stop=position2+1
                    break #terminating the loop when a stop codon is found
             break
    try:
        return len(dna[position1:position2])+2
    except UnboundLocalError:
        None 

def finding_ORF(frame):
    """TO IDENTIFY THE ORF IN THE READING FRAMES AND 
    IDENTIFY THE LONGEST ORF IN EACH SEQUENCE AND THE POSITION OF THE START CODON"""
    sequences=[reads.seq for reads in records] #creating alist of all the sequence reads as elements of the list
    ORF_lengths={}
    positions={}
    max_lenth_ORFs=[]
    for read in range(len(records)): #loop for accessing each sequence read from the list
        seq=str(sequences[read])
        a=startstop_codon(seq,frame) #calling the codon reading function
        if a== None: #assigning value for reads with no codons
            a = 0
        ORF_lengths[records[read].id]=a #creating a dictionary of ORF lengths(value) and sequence IDs(key)
    max_len=max(ORF_lengths,key=ORF_lengths.get) #getting the ID of the Sequence read containing the maximum length ORF
    #printing the result
    print("Longest ORF in Frame %d is of length %d and its ID is %s." % (frame+1,ORF_lengths[max_len],max_len))

"I have only given the functions that do this job from the whole script"

A sample output I am getting is:

Longest ORF in Frame 1 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 2 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 3 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.

The lengths are not correct and for somereason, it gets only stuck in a single read as in read 229 here. I spent hours looking for the mistake, but could'nt. can someone please look it up and tell me how I can refine it. THANKS IN ADVANCE!

gene sequence • 14k views
ADD COMMENT
1
Entering edit mode

The problem is with position1 = dna.index(codon1) and position2 = dna.index(codon2). If the same codon is present more than once in dna, index() method always returns its smallest/first position. That is why, you get the same ORF length all the time. You can fix this by assigning position1 = i and position2 = j.

ADD REPLY
0
Entering edit mode

Thank you. It is working now. But the issue I am facing now is it's reading only one orf per sequence and if it finds the first one, it stops checking for more. presence of more than ORF per sequence and the second on being longer than the first is complicating the problem. any idea how to overcome this?

ADD REPLY
0
Entering edit mode

Please see my answer.

ADD REPLY
0
Entering edit mode
  1. Please format your code properly using the code formatting options
  2. Make your life easier using the getorf program, then filter the output and only retain the longest ORF per sequence
ADD REPLY
0
Entering edit mode

At a quick glance you may need to show us more of the script, where these functions are actually called. Based on your output I’d guess that your loop over your file is incorrect, rather than your ORF finding.

ADD REPLY
0
Entering edit mode

Since I can run a "Fasta" file, I'm entering it:

for orflen, orf in startstop_codon ('dna.example.fasta', 0):           print (orflen, orf)

And the command is not reading the ORFs

ADD REPLY
1
Entering edit mode

This function reads just the sequence and not fasta files directly. You need to parse the fasta file and extract only the sequence before you can run the function. Hope it helps.

ADD REPLY
0
Entering edit mode

Can you please show the entire code for ORF reading?

ADD REPLY
7
Entering edit mode
5.5 years ago

I edited your startstop_codon() function to handle multiple ORFs in a single DNA sequence.

def startstop_codon(dna, frame):
    dna = dna.upper()
    for i in range(frame, len(dna), 3):
        codon1 = dna[i:i+3]
        if codon1 == 'ATG':
            position1 = i
            for j in range(position1, len(dna), 3):
                codon2 = dna[j:j+3]
                if codon2 in ['TAA', 'TAG', 'TGA']:
                    position2 = j
                    yield (position2-position1+3, dna[position1:position2+3])
                    break

For example, running the function the following way:

for orflen, orf in startstop_codon('ATGabcabcTGAabcATGabcTAG', 0):
    print(orflen, orf)

will give you:

(12, 'ATGABCABCTGA')
(9, 'ATGABCTAG')

or you can directly translate the above output to a list:

print(list(startstop_codon('ATGabcabcTGAabcATGabcTAG', 0)))

which will give you:

[(12, 'ATGABCABCTGA'), (9, 'ATGABCTAG')]
ADD COMMENT
0
Entering edit mode

Thank you so much. I wasn't familiar with yield function. It's helpful :)

ADD REPLY
0
Entering edit mode

Glad it helped. Does this answer your question?

ADD REPLY
0
Entering edit mode

Yes, it's helpful and it answer the question.

ADD REPLY
0
Entering edit mode

Don't forget to accept and upvote the answers which were helpful to you :)

ADD REPLY
0
Entering edit mode

Thanks for clean and easy to understand code. I am trying to write a program for finding ORFs for my multisequence FASTA file. for the above example you gave shouldn't there be one more open reading frame ATGabcabcTGAabcATGabcTAG of 24 length, that is, fromthe first start codon to the last stop codon. Thanks. Also, when you input '0' for 'frame' in the function parameters, does it mean the first starting position. how can we write this function to be iterative for all six reading frames?

ADD REPLY
0
Entering edit mode

shouldn't there be one more open reading frame ATGabcabcTGAabcATGabcTAG of 24 length

No, because there is a stop codon TGA in frame 1, an ORF is free from stop codons.

I am trying to write a program for finding ORFs for my multisequence FASTA file.

It might help to know that getOrf from the EMBOSS package already does that. Unless this is for practicing your programming skills, there is no real need to implement it again.

ADD REPLY
0
Entering edit mode

Thanks Michael. I am writing these codes for practicing, that is why not using any prewritten functions. But definitely can use getOrf to cross-check my answers.

ADD REPLY

Login before adding your answer.

Traffic: 1639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6