How I can make an open reading frame finder with Python ?
2
0
Entering edit mode
8.7 years ago
Kevin_Smith ▴ 10

I need to write a program to find open reading frames in a DNA sequence. The program should take as input the provided sequences in FASTA format (“sequence_A.fa” and “sequence_B.fa”), and supply as output:

(1) The sizes of the potential ORFs greater than 30 amino acids from all 3 forward reading frames. (2) The translations into protein of the ORFs. (3) The ORF does not have to begin with an ATG, but should be any sequence of nucleotides that encodes a polypeptide of >30 amino acids. (4) Output a peptide each line with this format: frame #: length_of_peptide sequence_of_peptide . I have a code snippet to set up a Python dictionary of codons in a file called “codondictionary.py” which I can copy into the program.

I will appreciate very much any help for a Python script.

sequence gene python homework • 13k views
ADD COMMENT
2
Entering edit mode
  1. What have you tried? Post a link to your code.
  2. You can do all of this with biopython.
ADD REPLY
1
Entering edit mode

Some questions:

  1. Why do you not consider an ATG important? I'm afraid you'll get a bunch of false positives.
  2. Why only 3 reading frames and not 6?
  3. Essentially you are looking for stretches of 90 or more nucleotides before a stop_codon and want these translated?
  4. How big is your input data?
ADD REPLY
0
Entering edit mode

The ORF does not have to begin with an ATG, I just want to obtain any sequence of nucleotides that encodes a polypeptide of >30 amino acids. The size of the input is 1040 nucleotides. 3 reading frames is fine, I just want to see the ones in the forward direction. Thanks

ADD REPLY
7
Entering edit mode
8.7 years ago

I have my doubts about your biological question, but here you go: Save script as e.g. ORFfinder.py and execute as python ORFfinder.py yourinput.fasta

import sys
from Bio import SeqIO
from Bio import Seq

record = SeqIO.read(open(sys.argv[1]), "fasta")
#Create three reading frames in forward direction, offset 0, 1, 2
readingframes = [Seq.translate(record.seq[i:], table='Standard', stop_symbol='*', to_stop=False, cds=False) for i in range(3)]

results = []
for frame in readingframes:
    for peptide in frame.split('*'): #Split translation over stopcodons
        if len(peptide) > 30:
            results.append(peptide)

#Use PotentialORFs.txt as output, can be changed            
#Write length and translation to file
with open('PotentialORFs.txt', 'w') as output: 
    for peptide in results:
        output.write("{}\t{}\n".format(len(peptide), peptide))
ADD COMMENT
0
Entering edit mode

I got no module named Bio. What this mean? Thanks

ADD REPLY
0
Entering edit mode

Is posible to do the program without biopython ?

ADD REPLY
1
Entering edit mode

You absolutely require biopython for this code to work.

I hope you have pip installed to install python packages:

(sudo) pip install biopython

Alternatively, have a look here: http://biopython.org/wiki/Download

ADD REPLY
1
Entering edit mode

FYI, pip install --user biopython is usually preferable.

ADD REPLY
1
Entering edit mode
8.7 years ago
Daniel ★ 4.0k

Check out the "Python for Bioinformatics" book (look here for the section on protein translations) or Python for biologists here.

ADD COMMENT

Login before adding your answer.

Traffic: 1593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6