Beginner in Python- need in assistance translating DNA into protein sequence
2
0
Entering edit mode
8.0 years ago
oki4 ▴ 10

Homework Problem: Write a Python program that translates a DNA sequences into a protein sequences. In other words, your script should convert all codons into amino acids. For this problem, the input file is in FASTA format with at least one sequence. You may NOT assume that the sequence length is always a multiple of 3 or that sequences will have the same length. Use the codon2aa map given in the code examples above. You will first have to transcribe the DNA sequence to RNA before being able to use it.

Program so far:

def transcribe(sequence):
    for ch in seq:
        rna_seq = seq.replace('T', 'U')
        return(rna_seq)

def translate_dna(sequence):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    ##the find method will give you the first index position (int) of the codon you're referring to 
    start_codon = sequence.find('ATG')

    ##You create a string of the starting nucleotide (A in this case) and rest of the sequence 
    seq_start = sequence[int(start_codon):]
    ##This selects the first nucleotide in codon/starting with with zero, selects for multiples of 3; n are multiples of 3 starting with 0
    for n in (0, len(seq_start), 3):
        if seq_start[n:n+3] in codon2aa:
            protein_seq += codon2aa[seq_start[n:n+3]]
    return protein_seq


seq = ''
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)

Error message I receive: Traceback (most recent call last): File "/Users/Oyinade/Documents/bnfo135/Homework2.py", line 52, in <module> translate_dna(line) File "/Users/Oyinade/Documents/bnfo135/Homework2.py", line 42, in translate_dna return protein_seq UnboundLocalError: local variable 'protein_seq' referenced before assignment

I don't what I'm doing wrong my code, can someone let me know if I'm heading in the correct decision. Thanks.

python • 43k views
ADD COMMENT
0
Entering edit mode

The last part of my program is in the wrong format, I don't why it did that but here it is correctly formatted:

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)
ADD REPLY
0
Entering edit mode

Try doing this inside a Try Except block and handle the error in Except block.

I found an implementation with web interface here. DNA to Protein Converter using Python 3 and Django 2.1

ADD REPLY
5
Entering edit mode
8.0 years ago

File: dna.fasta

>seq1
ATGCTGATGATAGGTATGGGTA
GATAGATGAGAGAGATGAGAAT
>seq2
ATGCGATGATAGATG
>seq3
ATGC

File: homework.py

def transcribe(sequence):
    rna_seq = sequence.replace('T', 'U')
    return(rna_seq)

def translate_rna(sequence):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    protein_seq = ''
    for n in range(0, len(sequence), 3):
        if sequence[n:n+3] in codon2aa:
            protein_seq += codon2aa[sequence[n:n+3]]
    return protein_seq


ids, sequences = [], []
n = -1
with open('dna.fasta') as fh:
    for line in fh:
        line = line.strip()
        if line[0] == '>':
            id = line.split()[0][1:]
            ids.append(id)
            sequences.append('')
            n += 1
        else:
            sequences[n] += line


for id, seq in zip(ids, sequences):
    print('>'+id)
    rna = transcribe(seq)
    protein = translate_rna(rna)
    print(protein)

File: homework1.py (for WouterDeCoster)

from itertools import groupby

def transcribe(sequence):
    return sequence.replace('T', 'U')

def translate_rna(s):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    l = [codon2aa.get(s[n:n+3], 'X') for n in range(0, len(s), 3)]
    return "".join(l)


with open('dna.fasta') as h:
    faiter = (x[1] for x in groupby(h, lambda l: l[0] == ">"))
    for header in faiter:
        header = next(header)[1:].strip()
        seq = "".join(s.strip() for s in next(faiter))
        print(header)
        print(translate_rna(transcribe(seq)))

Result:

>seq1 
MLMIGMGR_MREMRX
>seq2 
MR__M
>seq3 
MX
ADD COMMENT
0
Entering edit mode
    if sequence[n:n+3] in codon2aa:
        protein_seq += codon2aa[sequence[n:n+3]]

I would do this with a try-except block, now it's a potential silent error...

And why don't you translate the sequences while looping over the input file? No need to store the data in lists...

ADD REPLY
0
Entering edit mode

You are right. But hey, it's a homework, right? I would never write it as I did here.

ADD REPLY
0
Entering edit mode

Good point, agreed. This code needs more biopython.

ADD REPLY
0
Entering edit mode

I edited my answer. What do you think?

ADD REPLY
0
Entering edit mode

Looks pretty and probably gets the job done nicely. If you don't mind, some nitpicking comments just because I have 5 minutes until my RNA is ready for the next step in my protocol:

  • I would use return(protein_seq) to be python3 - compatible.
  • I would parse the fasta file using Biopython SeqIO
  • I would use Biopython translate (which also works on DNA, avoiding the transcription step)
  • I would directly print the result without assigning (unnecessary assignments take time and memory). Difference will be very very small in this case but I'm being pedantic here

This block

rna = transcribe(seq)
protein = translate_rna(rna)
print(protein)

could be

print(translate_rna(transcribe(seq)))
ADD REPLY
2
Entering edit mode

Thanks! Good points except the first one :). I think return value is the recommended way in Python2-3 (if you are only returning 1 value). After all, return is a language construct, not a function. Even if you want to return a tuple, there's no need for any parenthesis. Interesting thing, parsing FASTA sequences using Bio.SeqIO is slow (a lot of if statements are performed).

ADD REPLY
1
Entering edit mode

Right, I was wrong about return I see. But I don't like the way it looks without parenthesis ;-(

I can imagine that SeqIO is quite slow since it probably doesn't make assumptions about the expected fasta-format but instead checks everything carefully. After all, incorrect user-input is the biggest enemy of every tool. Perhaps your own fasta-parser is faster and fine, but to be 100% sure I would probably sacrifice a bit of execution speed (a comparison could be interesting).

I love how we got carried away after this homework question. Passionate about Python!

ADD REPLY
0
Entering edit mode

Is it possible to have the code to transcribe within the code to translate?

ADD REPLY
2
Entering edit mode
8.0 years ago

This is a homework question so try harder ;-) I formatted your code for readability, please use the 101010 button next time.

The error message states exactly what's wrong: local variable 'protein_seq' referenced before assignment It even gives you the line where things went wrong. Python's error messages are great, take your time to read them and you'll solve 99% of errors.

ADD COMMENT
0
Entering edit mode

Ok I created an variable (protein_seq) containing an empty string in the translate_dna function. When I run the program, it does not transcribe and translate the sequence. Am I calling the function incorrectly? Thanks.

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)
ADD REPLY
0
Entering edit mode

The transcribe and translate function are called correctly but you don't use their results.

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            rna = transcribe(line)
            protein = translate_dna(rna)
            seq += (protein)
print(seq)

You just kept using "line" but it wasn't changed since you didn't use what was returned by the functions

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6