Hi everyone, I recently started using python with biopython. I'm trying to practice to get the translate ORF using this gene taken from Genbank as input: NM_100684.3
However, my output does not show me the correct ORF and I get a different amino acid sequence both in composition and length.
What am I doing wrong?
These are the scripts used by me
>>>from Bio import SeqIO
>>>record = SeqIO.read("sequence.fasta", "fasta")
>>> table = 1
>>> min_pro_len = 100
>>>for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3) #Multiple of three
for pro in nuc[frame:frame+length].translate(table).split("*"):
if len(pro) >= min_pro_len:
print("%s...%s - length %i, strand %i, frame %i" \
% (pro[:30], pro[-3:], len(pro), strand, frame))
YSDIDQINLNQISNLQRNLKYFITMGDSTG...NNV - length 554, strand 1, frame 2
SSPGDKGHNCKGGSASSLCPHREEHHSHNG...ILT - length 162, strand -1, frame 1
IEHQDSHDDVQPTGYKEGDPPGREGCGTAA...HNW - length 216, strand -1, frame 1
TKVTGNVQATIITPIHVSPCSVVKCEVEKK...SDA - length 122, strand -1, frame 2
This above is my output but isn't corrected and do not start with methionine, in Genbank the correct protein has 530 a.a. and start with "MGDSTGEPGSSMHGVTGREQ ..."
From the docs you're following:
As it stands, all you've assessed is regions uninterrupted by stop codons, you haven't gone to the next step of identifying starts etc.