Reverse translate to all possible codons
3
1
Entering edit mode
4.6 years ago
kk1990 ▴ 10

Hi!

I want to reverse translate a short protein sequence (around 50 aa) to DNA sequence using all possible codons for all the positions. If Glu is GAA and GAG, I want to have a sequence variants with both. So as a result I want to have a collection of all possible DNA sequences that could have made this protein (regardless the codon usage).

Is there any tool like this available? If not what's the fastest way to obtain such file?

Thank you!

translation codon proteinsequence • 4.0k views
ADD COMMENT
5
Entering edit mode
4.6 years ago

Interesting question - I never thought of it that way. I ended up with something like this:

import itertools

d = {
    'A': ['GCA', 'GCC', 'GCG', 'GCT'],
    'C': ['TGC', 'TGT'],
    'D': ['GAC', 'GAT'],
    'E': ['GAA', 'GAG'],
    'F': ['TTC', 'TTT'],
    'G': ['GGA', 'GGC', 'GGG', 'GGT'],
    'H': ['CAC', 'CAT'],
    'I': ['ATA', 'ATC', 'ATT'],
    'K': ['AAA', 'AAG'],
    'L': ['CTA', 'CTC', 'CTG', 'CTT', 'TTA', 'TTG'],
    'M': ['ATG'],
    'N': ['AAC', 'AAT'],
    'P': ['CCA', 'CCC', 'CCG', 'CCT'],
    'Q': ['CAA', 'CAG'],
    'R': ['AGA', 'AGG', 'CGA', 'CGC', 'CGG', 'CGT'],
    'S': ['AGC', 'AGT', 'TCA', 'TCC', 'TCG', 'TCT'],
    'T': ['ACA', 'ACC', 'ACG', 'ACT'],
    'V': ['GTA', 'GTC', 'GTG', 'GTT'],
    'W': ['TGG'],
    'Y': ['TAC', 'TAT'],
    '_': ['TAA', 'TAG', 'TGA'],
}

def generator(protein):
    l = [d[aa] for aa in protein]
    for comb in itertools.product(*l):
        yield "".join(comb)


if __name__ == '__main__':
    import sys
    protein_seq = sys.argv[1]
    g = generator(protein_seq)
    for dna_seq in g:
        print(dna_seq)

Run:

python script.py MKS

Output:

ATGAAAAGC
ATGAAAAGT
ATGAAATCA
ATGAAATCC
ATGAAATCG
ATGAAATCT
ATGAAGAGC
ATGAAGAGT
ATGAAGTCA
ATGAAGTCC
ATGAAGTCG
ATGAAGTCT
ADD COMMENT
1
Entering edit mode
4.6 years ago
JC 13k

Well, it is possible, you can use a simple BioPerl/BioPython script to get all codons for an amino acid, but the main question, do you really want that?

If you want to use that to search across a DNA dataset, it is better to encode all DNA into proteins and search in protein levels.

Besides, some amino acids will generate a large combination set, like a run of leucines, it can be encoded by 6 codons, so if you have 50 aa, ti will generate 6**50 combinations.

ADD COMMENT
1
Entering edit mode
4.6 years ago
Mensur Dlakic ★ 28k

I suggest you try SwiftLib. Its goal is to generate a small library of degenerate codons within a certain diversity limit, but will probably do what you want if you increase the library size.

ADD COMMENT

Login before adding your answer.

Traffic: 1591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6