Question

Beginner in Python- need in assistance translating DNA into protein sequence

0

Entering edit mode

8.9 years ago

oki4 ▴ 10

Homework Problem: Write a Python program that translates a DNA sequences into a protein sequences. In other words, your script should convert all codons into amino acids. For this problem, the input file is in FASTA format with at least one sequence. You may NOT assume that the sequence length is always a multiple of 3 or that sequences will have the same length. Use the codon2aa map given in the code examples above. You will first have to transcribe the DNA sequence to RNA before being able to use it.

Program so far:

def transcribe(sequence):
    for ch in seq:
        rna_seq = seq.replace('T', 'U')
        return(rna_seq)

def translate_dna(sequence):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    ##the find method will give you the first index position (int) of the codon you're referring to 
    start_codon = sequence.find('ATG')

    ##You create a string of the starting nucleotide (A in this case) and rest of the sequence 
    seq_start = sequence[int(start_codon):]
    ##This selects the first nucleotide in codon/starting with with zero, selects for multiples of 3; n are multiples of 3 starting with 0
    for n in (0, len(seq_start), 3):
        if seq_start[n:n+3] in codon2aa:
            protein_seq += codon2aa[seq_start[n:n+3]]
    return protein_seq


seq = ''
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)

Error message I receive: Traceback (most recent call last): File "/Users/Oyinade/Documents/bnfo135/Homework2.py", line 52, in <module> translate_dna(line) File "/Users/Oyinade/Documents/bnfo135/Homework2.py", line 42, in translate_dna return protein_seq UnboundLocalError: local variable 'protein_seq' referenced before assignment

I don't what I'm doing wrong my code, can someone let me know if I'm heading in the correct decision. Thanks.

python • 44k views

ADD COMMENT • link updated 4.8 years ago by maram.mostafa08800 • 0 • written 8.9 years ago by oki4 ▴ 10

0

Entering edit mode

The last part of my program is in the wrong format, I don't why it did that but here it is correctly formatted:

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)

ADD REPLY • link updated 8.9 years ago by WouterDeCoster 48k • written 8.9 years ago by oki4 ▴ 10

0

Entering edit mode

Try doing this inside a Try Except block and handle the error in Except block.

I found an implementation with web interface here. DNA to Protein Converter using Python 3 and Django 2.1

ADD REPLY • link 5.6 years ago by rkking • 0

score 5 · Answer 1 · 2016-12-07

5

Entering edit mode

8.9 years ago

Andrzej Zielezinski 11k

File: dna.fasta

>seq1
ATGCTGATGATAGGTATGGGTA
GATAGATGAGAGAGATGAGAAT
>seq2
ATGCGATGATAGATG
>seq3
ATGC

File: homework.py

def transcribe(sequence):
    rna_seq = sequence.replace('T', 'U')
    return(rna_seq)

def translate_rna(sequence):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    protein_seq = ''
    for n in range(0, len(sequence), 3):
        if sequence[n:n+3] in codon2aa:
            protein_seq += codon2aa[sequence[n:n+3]]
    return protein_seq


ids, sequences = [], []
n = -1
with open('dna.fasta') as fh:
    for line in fh:
        line = line.strip()
        if line[0] == '>':
            id = line.split()[0][1:]
            ids.append(id)
            sequences.append('')
            n += 1
        else:
            sequences[n] += line


for id, seq in zip(ids, sequences):
    print('>'+id)
    rna = transcribe(seq)
    protein = translate_rna(rna)
    print(protein)

File: homework1.py (for WouterDeCoster)

from itertools import groupby

def transcribe(sequence):
    return sequence.replace('T', 'U')

def translate_rna(s):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    l = [codon2aa.get(s[n:n+3], 'X') for n in range(0, len(s), 3)]
    return "".join(l)


with open('dna.fasta') as h:
    faiter = (x[1] for x in groupby(h, lambda l: l[0] == ">"))
    for header in faiter:
        header = next(header)[1:].strip()
        seq = "".join(s.strip() for s in next(faiter))
        print(header)
        print(translate_rna(transcribe(seq)))

Result:

>seq1 
MLMIGMGR_MREMRX
>seq2 
MR__M
>seq3 
MX

ADD COMMENT • link 8.9 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

    if sequence[n:n+3] in codon2aa:
        protein_seq += codon2aa[sequence[n:n+3]]

I would do this with a try-except block, now it's a potential silent error...

And why don't you translate the sequences while looping over the input file? No need to store the data in lists...

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

You are right. But hey, it's a homework, right? I would never write it as I did here.

ADD REPLY • link 8.9 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Good point, agreed. This code needs more biopython.

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

I edited my answer. What do you think?

ADD REPLY • link 8.9 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Looks pretty and probably gets the job done nicely. If you don't mind, some nitpicking comments just because I have 5 minutes until my RNA is ready for the next step in my protocol:

I would use return(protein_seq) to be python3 - compatible.
I would parse the fasta file using Biopython SeqIO
I would use Biopython translate (which also works on DNA, avoiding the transcription step)
I would directly print the result without assigning (unnecessary assignments take time and memory). Difference will be very very small in this case but I'm being pedantic here

This block

rna = transcribe(seq)
protein = translate_rna(rna)
print(protein)

could be

print(translate_rna(transcribe(seq)))

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

2

Entering edit mode

Thanks! Good points except the first one :). I think return value is the recommended way in Python2-3 (if you are only returning 1 value). After all, return is a language construct, not a function. Even if you want to return a tuple, there's no need for any parenthesis. Interesting thing, parsing FASTA sequences using Bio.SeqIO is slow (a lot of if statements are performed).

ADD REPLY • link 8.9 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

Right, I was wrong about return I see. But I don't like the way it looks without parenthesis ;-(

I can imagine that SeqIO is quite slow since it probably doesn't make assumptions about the expected fasta-format but instead checks everything carefully. After all, incorrect user-input is the biggest enemy of every tool. Perhaps your own fasta-parser is faster and fine, but to be 100% sure I would probably sacrifice a bit of execution speed (a comparison could be interesting).

I love how we got carried away after this homework question. Passionate about Python!

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

Is it possible to have the code to transcribe within the code to translate?

ADD REPLY • link 6.9 years ago by mito_ninja • 0

score 2 · Answer 2 · 2016-12-07

2

Entering edit mode

8.9 years ago

WouterDeCoster 48k

This is a homework question so try harder ;-) I formatted your code for readability, please use the 101010 button next time.

The error message states exactly what's wrong: local variable 'protein_seq' referenced before assignment It even gives you the line where things went wrong. Python's error messages are great, take your time to read them and you'll solve 99% of errors.

ADD COMMENT • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

Ok I created an variable (protein_seq) containing an empty string in the translate_dna function. When I run the program, it does not transcribe and translate the sequence. Am I calling the function incorrectly? Thanks.

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)

ADD REPLY • link 8.9 years ago by oki4 ▴ 10

0

Entering edit mode

The transcribe and translate function are called correctly but you don't use their results.

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            rna = transcribe(line)
            protein = translate_dna(rna)
            seq += (protein)
print(seq)

You just kept using "line" but it wasn't changed since you didn't use what was returned by the functions

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k