I need to translate aligned DNA sequences with biopython
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
seq = Seq("tt-aaaatg")
seq.translate()
Running this script will get error: Bio.Data.CodonTable.TranslationError: Codon 'TT-' is invalid.
Is there a way to translate the 'tt-' as X and thus the whole translated sequences will be 'XKM'?
This will be very useful in translating aligned sequences. For example, an aligned sequence set is stored in form of pandas DataFrame named as "df" as:
import pandas as pd
df = pd.DataFrame([['A',Seq("tt-aaaatg")],['B',Seq("tttaaaatg")],['C',Seq("tttaaaatg")]],columns=['seqName','seq'])
print(df)
The df will be shown as:
seqName seq
A Seq("tt-aaaatg")
B Seq("tttaaaatg")
C Seq("tt-aaaatg")
If 'tt-' can be translated as "x", then using the code:
df['prot'] = pd.Seris([x.translate() for x in df.seq])
We can get:
seqName seq prot
0 A (t, t, g, a, a, a, a, t, g) (X, K, M)
1 B (t, t, t, a, a, a, a, t, g) (F, K, M)
2 C (t, t, t, a, a, a, a, t, g) (F, K, M)
However the current biopython can not translate "tt-" as "X" and it just throw out error. It seems to me that I have to remove all gaps in the aligned sequences and then translated them after which I have to realign the translated protein sequences.
How do you deal with such a problem? Thank you in advance.
Is there any reason that the gaps need to be translated? There is no gap in the actual sequence, so tt-aaaatg would have a frameshift in relation to tttaaaatg and code for LK instead of FKM.
Just for the sake of it, you could try to see what the biopython translation engine does with "N" in nucleic acid sequences and if that works, you could just replace the gaps with "N".
In that case, however, there is the question what the translation engine does with "N" at a wobble position (e.g. all four GGN codons code for Glycine - would GGN be translated as X or G?).