Getting in frame stop codons from fasta
1
0
Entering edit mode
8.6 years ago
User000 ▴ 750

Hello, I need to find a start codon and all in frame stop codons related to that primary start codon, i.e startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCTG_stopCodon(2)_ATCAAGA. This is my code, which gives me the positions of all ORF. How can I modify it to achive positions of start codon and all in frame stop codons related to that primary start codon.

from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def getORF(sequence, treshold, start_codons, stop_codons):
    orfs = []
    for j in range(0, 3):
        start_codon_index = 0
        end_codon_index = 0
        start_codon_found = False
        for indx in range(j, len(sequence), 3):
            current_codon = sequence[indx:indx+3]
            if current_codon in start_codons and not start_codon_found:
                start_codon_found = True
                start_codon_index = indx
                s=start_codon_index+1
            if current_codon in stop_codons and start_codon_found:
                end_codon_index = indx
                e=end_codon_index+3
                length = end_codon_index - start_codon_index + 3
                if length >= treshold * 3:
                    orfs.append(s)
                    orfs.append(e)
                start_codon_found = False
    return orfs
f = open("myfile.fa","r")
o = open("out.txt","w")
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
for record in SeqIO.parse(f,'fasta'):
    seq=record.seq
    #compl_seq =record.seq.reverse_complement()
    name=record.id
    orfs = getORF(seq, 30, start, stop)
    #complement_orfs = getComplementORF(compl_seq, 30, start, stop)
    print >> o, name, '+', orfs, '-', complement_orfs
python • 7.0k views
ADD COMMENT
1
Entering edit mode

Hi User000,

I'll try to give you some assistance in this, but I'm a bit confused by your example: startCodon_AGGAAG_stopCodon(1)_GAA**GGT**AAC**AGC**TCTG_stopCodon(2)_ATCAAGA Why is the part between stopCodon(1) and stopCodon(2) not in-frame? Why is there still a stretch of nucleotides after your final stopCodon?

Furthermore, how large is your input? That influences if we can do comfortably everything in memory or whether we have to think of writing memory-efficient scripts ;)

ADD REPLY
0
Entering edit mode

Hi, thanks a lot. These are my stop codons "TAA","TAG","TGA". I have a lot of artificial cds, (50K seq-s) and I would like to find a 1st start codon and all in frame stop codons related to that primary start codon..

ADD REPLY
0
Entering edit mode

Yes I see, but in your example, the second stop codon is not in frame...

So actually, you just have to search for the first start codon in the sequence. As soon as you found that, you can make in frame codons following that start codon. Then you just need to translate these codons and identify stop codons. Or am I oversimplifying things?

ADD REPLY
0
Entering edit mode

I explain better. I have CDS(those do not necessarily have a stop codon),so I add several bp artificially to my cds seq-s and want find all stop codons that are in frame with the 1st start codon. And i need positions of this start and stop codons..In the code above, it gives me start codon and stop codon (ORF), then it looks for the next start codon and stop codon. I dont know if it is explained better...

ADD REPLY
0
Entering edit mode

Might be caffeine deprivation of my brain, but it's not yet completely clear to me. Why do you add artificial nucleotides to the cds?

ADD REPLY
0
Entering edit mode

not artificial nucleotides :) I lengthen them for ie 100 bp obviously using coordinates from genome, in such a way creating kind of artificial CDS, so I am interested to see how far away are my stop codons, so I can evaluate whether to lenghen my cds o not...

ADD REPLY
0
Entering edit mode

So let me try to summarize: 0) input is fasta 1) first job: find the first start codon 2) then: find all stop codons in frame with this first start codon 3) report as: startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCT_stopCodon(2)_blablabla_finalStopCodon

ADD REPLY
0
Entering edit mode

resport as name_id pos_start_fr1 pos_stop1 pos_stop2, pos_start_fr2 pos_stop1 post_stop2

ADD REPLY
0
Entering edit mode

So if I explain roughly the algorithm is like that: take a sequence and look for start codon in 3 frames (actually 6 cos I need also complementary),find 1st start codon write it,search for in frame stop codon,write it,search for the next in frame stop codon write it..smth like this, does it make sense?

ADD REPLY
0
Entering edit mode

Do you need per sequence just one start codon, or the first start codon for each of the 6 reading frames?

ADD REPLY
0
Entering edit mode

start codon for each of the 6 frames if there is. so I prefer the output to be like this (+1 pos_start codons pos_stop_codon1 pos_stop_codon2, +2 pos_start codons pos_stop_codon1 pos_stop_codon2 ),otherwise I dont need frames written, but if divided by a comma I can understand that these are different frames..

ADD REPLY
0
Entering edit mode

Hahha, this thread xD

ADD REPLY
2
Entering edit mode

I'll probably write some real code with some pseudocode tomorrow to get him started. First thing I thought yesterday was "not another ORF script"...

ADD REPLY
0
Entering edit mode

What happens if there is more than one ATG in the fasta entry? Is only the first considered the start?

In which case, could the output look something like:

ID13123
    ORF 1: None
    ORF 2:
        Start - 342
        Ends - 453,499,589
    ORF 3:
        Start - 32
        Ends - 94,399,909,4302
ID13124
    ....
ADD REPLY
0
Entering edit mode

I would consider only the first ATG as start and would like results as in the code (start1 end1, start1 end2) or (start1, end1, end2).

ADD REPLY
4
Entering edit mode
8.6 years ago

After the lovely discussion above it's time to start thinking about answers :) I wanted to write a few pointers to help you write the script, but got carried away while writing and produced it almost completely. Left a few pieces for you to complete ;)

If anything is unclear, let me know.

As you can see I splitted the function(s) in multiple very simple functions which only do one job. Because that makes writing, thinking and debugging so much easier. Code is far more readable. You start by writing just the frame work. First you need readingframes. Then you need startcodons, then you need stopcodons. Every function does his part of the chain. The function makeframes() is for you to write. If you have question about how to solve it or change the code to suit your needs, let me know!

import sys
from Bio import SeqIO
def findstart(sequence):
"""
Takes a sequence, locates the start codon, returns the rest of the sequence split by codons
"""
try: #We are not sure a start codon is present, therefore I use the try-except construct to catch potential errors without crashing the script
start = sequence.index("ATG") #the string method .index() returns the first position of "ATG" if any is present, if not it gives a ValueError
pseudoCDS = sequence[start:] #Create a new string starting from the start codon
return([pseudoCDS[i:i+3] for i in range(0, len(pseudoCDS), 3)]) #Split to codons with a nifty list comprehension, it's actually a for loop on one line
except ValueError: #In the case the startcodon is not found this block will be triggered and we can decide what to do with that exception
print("No start codon found.") #Or ignore, depending on what you want
return(0) #Since this function returns 0 when no start codon is found, this situation is currently silently ignored
def makeframes(sequence):
"""
Takes a sequence and returns 6 reading frames in a list
"""
####incomplete!
return([sequence])
def identifystops(sequence):
"""
Taking a list of codons from a sequence starting with a startcodon
identifies stopcodons in the same reading frame, converting them to "___STOP___"
"""
assert sequence[0] == "ATG" #The assert statement is extremely useful for sanity checks in your code. You use it if you are quite sure that it will be true but just to check if everything went fine and there are no bugs in the code
output = [] #An output list to contain the result
for elem in sequence: #Remember that I split up the sequence in codons, which are now used to loop over
if elem in ["TAA","TAG","TGA"]:
output.append("___STOP___") #When a stopcodon is found, replace it by a "___STOP___" in the output list
else:
output.append(elem) #If the codon is no stopcodon, just keep the codon
return("".join(output)) #Return the output, but join the list to a string so it again is a continuous sequence
def main(sequence):
"""
This is the master function, taking care of the other functions
"""
frames = makeframes(sequence)
#assert len(frames) == 6 #Assertion is not valid if function makeframes isn't written, therefore this line is written as a comment. I expect 6 sequences in the frames list so I check the length and it should be six if the makeframes function is valid
for readingframe in frames: #for every readingframe...
firststart = findstart(readingframe) #...look for a startcodon...
if firststart: #Check if return value of findstart is a list (True) or '0' (False)
output = identifystops(firststart) #...and look for stopcodons...
print(output)
#do something with the output
sequences = [str(item.seq) for item in SeqIO.parse(open(sys.argv[1]), "fasta")] #list comprehension to properly format the input the way I want it
for sequence in sequences:
main(sequence) #Execute the main function for every input sequence

ADD COMMENT
0
Entering edit mode

thank you so much! Looks super difficult for me, as I don't use a lot of things you used. About frames, may be I can create 3 frames and at the end use compl_seq =record.seq.reverse_complement() as in my code above? Also, while thinking how to do it, I discovered that there is a orf predictor that does almost what I want, I wish I could have it's script, but I guess it is written in perl.

ADD REPLY
1
Entering edit mode

Take your time to go through it and I'd be happy to explain parts and functions you don't understand. But with perl stuff I can't help you ;)

Yes, you can use the .reverse_complement() method BUT I already converted each sequence to a string in line 48 in the list comprehension (my favorite construction in python). So just using .reverse_complement() no longer works, because a string is not a seq object and doesn't have a .reverse_complement() method. But of course you can change line 48 so the object remains a seq object by removing the str() function. But you need to convert it to a string then in the makeframes() function, because the code needs strings downstream. This is because a seq object doesn't have a .index() method. Is that clear?

ADD REPLY
0
Entering edit mode

Yes, more or less! I might come back to you later with questions!THNAKS

ADD REPLY
0
Entering edit mode

Dear @Wouter, the "lovely discussion" you have mentioned was cool!

And explaining different parts of the valuable script would be a class for me, too

Thank you

ADD REPLY
1
Entering edit mode

For "educational purposes" I added comments to the code above explaining almost every line, let me know if something isn't clear.

ADD REPLY
0
Entering edit mode

Hi! So,I added makeframes function in the code above, smth like this:

for n in range(0,len(sequence),3):
    frames = sequence[n:n+3]
return([sequence])

It is giving me the result as seq-s, while I want ----- > a header, position_start_codon, pos_stop,pos_next_stop How to modify?thanks

ADD REPLY
0
Entering edit mode

That makeframes() function doesn't make sense. You are just returning sequence in a list, and your frames = sequence[n:n+3] is not doing what you think it does.

For getting the position you'll need to use the index, for example using enumerate.

ADD REPLY
0
Entering edit mode

for frame in range(0,3): return([sequence])

?

ADD REPLY
0
Entering edit mode

You only need 3 reading frames? Then why not just

def makeframes(sequence):
    return([sequence, sequence[1:], sequence[2:]])
ADD REPLY
0
Entering edit mode

I need 6, but I was thinking to use reserve_compl later on..

ADD REPLY
0
Entering edit mode

Possible, but I think it would make more sense to do it in this step and return 6 reading frames. However, you already decided that you need positions rather than sequences and you are free to write this script as you feel like :)

ADD REPLY
0
Entering edit mode

Ok, I try.. thanks you

ADD REPLY

Login before adding your answer.

Traffic: 1587 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6