Question

How to deal with gaps during translation with biopython

1

Entering edit mode

5.6 years ago

sunyeping ▴ 110

I need to translate aligned DNA sequences with biopython

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
seq = Seq("tt-aaaatg")
seq.translate()

Running this script will get error: Bio.Data.CodonTable.TranslationError: Codon 'TT-' is invalid.

Is there a way to translate the 'tt-' as X and thus the whole translated sequences will be 'XKM'?

This will be very useful in translating aligned sequences. For example, an aligned sequence set is stored in form of pandas DataFrame named as "df" as:

import pandas as pd

df = pd.DataFrame([['A',Seq("tt-aaaatg")],['B',Seq("tttaaaatg")],['C',Seq("tttaaaatg")]],columns=['seqName','seq'])

print(df)

The df will be shown as:

   seqName                seq
        A                 Seq("tt-aaaatg")
        B                 Seq("tttaaaatg")
        C                 Seq("tt-aaaatg")

If 'tt-' can be translated as "x", then using the code:

df['prot'] = pd.Seris([x.translate() for x in df.seq])

We can get:

  seqName                          seq           prot
0       A           (t, t, g, a, a, a, a, t, g)  (X, K, M)
1       B           (t, t, t, a, a, a, a, t, g)  (F, K, M)
2       C           (t, t, t, a, a, a, a, t, g)  (F, K, M)

However the current biopython can not translate "tt-" as "X" and it just throw out error. It seems to me that I have to remove all gaps in the aligned sequences and then translated them after which I have to realign the translated protein sequences.

How do you deal with such a problem? Thank you in advance.

sequence • 2.6k views

ADD COMMENT • link updated 9 months ago by fournier.berlin ▴ 10 • written 5.6 years ago by sunyeping ▴ 110

1

Entering edit mode

Is there any reason that the gaps need to be translated? There is no gap in the actual sequence, so tt-aaaatg would have a frameshift in relation to tttaaaatg and code for LK instead of FKM.

Just for the sake of it, you could try to see what the biopython translation engine does with "N" in nucleic acid sequences and if that works, you could just replace the gaps with "N".

In that case, however, there is the question what the translation engine does with "N" at a wobble position (e.g. all four GGN codons code for Glycine - would GGN be translated as X or G?).

ADD REPLY • link 5.6 years ago by cschu181 ★ 2.8k

score 2 · Answer 1 · 2019-05-04

Hello! Have you tried looking at the documentation for the translate function? I believe this resource will help you troubleshoot.

Specifically, for the problem you described, you should be able to specify the character that represents a gap. So in your case, this would be: df['prot'] = pd.Seris([x.translate(gap="-") for x in df.seq])

score 0 · Answer 2 · 2024-02-26

0

Entering edit mode

9 months ago

fournier.berlin ▴ 10

I tried to answer converting any triplet containing at least one - into ---:

seq='tt-aaaatg'

seq = Seq(''.join([seq[i:i+3] if '-' not in seq[i:i+3] else '---' for i in range(0, len(seq), 3)]))
str(seq.translate()).replace("-", "X")

which outputs 'XKM' for your specific sequence.

ADD COMMENT • link 9 months ago by fournier.berlin ▴ 10