Question

Trouble Finding ORFs in DNA Sequence

0

Entering edit mode

4.9 years ago

kmussar • 0

Hi,

I'm trying to find the longest ORFs in a fasta file containing a few genomic sequences. I've been working with the function in the biopython documents, but am not sure I quite understand the code. Any clarification would be helpful!

My questions:

Do I need to worry about iterating through nucleotides in groups of 3? Since this function translates into proteins, I don't think I do?
For the standard code, translation table 1), why is the amino acid L being marked as a start codon? (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1) I was under the impression only M (AUG/ATG) is a start codon.
The code given in the python documents (http://biopython.org/DIST/docs/tutorial/Tutorial.pdf) does not include any statements indicating that the ORF should start at a Met/AUG/ATG. Why is this?
I believe I am finding the longest ORF by finding the first M and going until reaching a *. I may be missing shorter reading frames in this example by not looking for subsequent M downstream of the first one. But I think that is okay given the question - to find the longest ORFs. Is this reasoning correct?

Code:

table = 1Output of code listed below as image. 
min_pro_len = 100
ORF_f2 = []

def find_orfs(filename):    
    for record in SeqIO.parse(filename, "fasta"):
        for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
            for frame in range(3):
                # Splits on stop codons
                for pro in nuc[frame:].translate(table).split("*"):
                    # Truncate sequence to start with the start codon
                    pro = pro[pro.find('M'):]
                    if frame == 2: 
                        # add length of sequence to list if in frame 2
                        ORF_f2.append(len(pro))
                        if len(pro) >= min_pro_len:
                            print("%s...%s - length %i, strand %i, frame %i" \
                                % (pro[:30], pro[-3:], len(pro), strand, frame))

    print(max(ORF_f2))

find_orfs('dna.example.fasta')

Output of code:

MSTFNRLHLIRQVDLFTLKLFVSAVEEQQI...PRD - length 305, strand 1, frame 2
MQPIVLGRLEIQKVVEMETSLPIEVLFPGV...TQV - length 294, strand 1, frame 2
MASSQGTVDFIVEQMAAAGTVSARKMFGEY...RRR - length 107, strand 1, frame 2
MSSFLCSRRMRDGRAHVSQNTPGRSATWHA...PAS - length 247, strand -1, frame 2
MAVLVQNVTLSLMKVSDRGTQNFSFSRPEN...IRR - length 128, strand -1, frame 2
MHRATGFERTPATHPVGGDRERIVQRPIPV...DDQ - length 535, strand 1, frame 2
MPTDDASAASIDAGGTRNAWRPGLGAHEPD...RTG - length 299, strand 1, frame 2
MRLSHERLEKDLLSKPTTLRDSITELRRLS...VHV - length 314, strand -1, frame 2

(truncated)

python biopython • 1.7k views

ADD COMMENT • link 4.9 years ago by kmussar • 0

0

Entering edit mode

GTG (V) and TTG (L) are alternative start cordons. The most common one is ATG (M), the other two are rare in eukaryotes.

You don't need to group by 3 because of the translate function as you mentioned.

This link might also help:

How to extract the longest orf?

ADD REPLY • link 4.9 years ago by Fatima ▴ 1000