Entering edit mode
4.9 years ago
kmussar
•
0
Hi,
I'm trying to find the longest ORFs in a fasta file containing a few genomic sequences. I've been working with the function in the biopython documents, but am not sure I quite understand the code. Any clarification would be helpful!
My questions:
- Do I need to worry about iterating through nucleotides in groups of 3? Since this function translates into proteins, I don't think I do?
- For the standard code, translation table 1), why is the amino acid L being marked as a start codon? (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1) I was under the impression only M (AUG/ATG) is a start codon.
- The code given in the python documents (http://biopython.org/DIST/docs/tutorial/Tutorial.pdf) does not include any statements indicating that the ORF should start at a Met/AUG/ATG. Why is this?
- I believe I am finding the longest ORF by finding the first M and going until reaching a *. I may be missing shorter reading frames in this example by not looking for subsequent M downstream of the first one. But I think that is okay given the question - to find the longest ORFs. Is this reasoning correct?
Code:
table = 1Output of code listed below as image.
min_pro_len = 100
ORF_f2 = []
def find_orfs(filename):
for record in SeqIO.parse(filename, "fasta"):
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
# Splits on stop codons
for pro in nuc[frame:].translate(table).split("*"):
# Truncate sequence to start with the start codon
pro = pro[pro.find('M'):]
if frame == 2:
# add length of sequence to list if in frame 2
ORF_f2.append(len(pro))
if len(pro) >= min_pro_len:
print("%s...%s - length %i, strand %i, frame %i" \
% (pro[:30], pro[-3:], len(pro), strand, frame))
print(max(ORF_f2))
find_orfs('dna.example.fasta')
Output of code:
MSTFNRLHLIRQVDLFTLKLFVSAVEEQQI...PRD - length 305, strand 1, frame 2
MQPIVLGRLEIQKVVEMETSLPIEVLFPGV...TQV - length 294, strand 1, frame 2
MASSQGTVDFIVEQMAAAGTVSARKMFGEY...RRR - length 107, strand 1, frame 2
MSSFLCSRRMRDGRAHVSQNTPGRSATWHA...PAS - length 247, strand -1, frame 2
MAVLVQNVTLSLMKVSDRGTQNFSFSRPEN...IRR - length 128, strand -1, frame 2
MHRATGFERTPATHPVGGDRERIVQRPIPV...DDQ - length 535, strand 1, frame 2
MPTDDASAASIDAGGTRNAWRPGLGAHEPD...RTG - length 299, strand 1, frame 2
MRLSHERLEKDLLSKPTTLRDSITELRRLS...VHV - length 314, strand -1, frame 2
(truncated)
GTG (V) and TTG (L) are alternative start cordons. The most common one is ATG (M), the other two are rare in eukaryotes.
You don't need to group by 3 because of the translate function as you mentioned.
This link might also help:
How to extract the longest orf?