Question

How To Backtranslate Multiple Protein Sequences In A File To Nucleotide Sequences Using Perl Programming

1

Entering edit mode

14.6 years ago

Payal ▴ 150

here i have written the script using associative array. but unable to get the answer.

open(FH,"seq.txt");
@seq1 = <FH>;
shift(@seq1);
$seq = join('',@seq1);
print"$seq";
%pro=('I'=>'AUU','I'=>'AUC','I'=>'AUA','L'=>'CUU','L'=>'CUC','L'=>'CUA','L'=>'CUG','L'=>'UAA',
'L'=>'UUG','V'=>'GUU','V'=>'GUC','V'=>'GUA','V'=>'GUG','F'=>'UUU','F'=>'UUC','M'=>'AUG','C'=>'UGU',
'C'=>'UGC','A'=>'GCU','A'=>'GCC','A'=>'GCA','A'=>'GCG','G'=>'GGU','G'=>'GGC','G'=>'GGA','G'=>'GGG',
'P'=>'CCU','P'=>'CCC','P'=>'CCA','P'=>'CGG','T'=>'ACU','T'=>'ACC','T'=>'ACA','T'=>'ACG','S'=>'TCT',
'S'=>'TCC','S'=>'TCA','S'=>'TCG','S'=>'AGT','S'=>'AGC','Y'=>'TAT','Y'=>'TAC',
'W'=>'TGG','Q'=>'CAA','Q'=>'CAG','N'=>'AAT','N'=>'AAC','H'=>'CAT','H'=>'CAC',
'E'=>'GGA','E'=>'GAG','D'=>'GAT','D'=>'GAC','K'=>'AAA','K'=>'AAG',
'R'=>'CGT','R'=>'CGC','R'=>'CGA','R'=>'CGG','R'=>'AGA','R'=>'AGG',
'*'=>'TAA','*'=>'TAG','*'=>'TGA');
$p='';
$ln=length($seq);
$j=$ln;
for($i=0,$k=0;$i<$ln,$k<$j;$k++)
{
$fra[$k]=substr($seq,$i,1);
$i=$i++;
}
for($k=0;$k<$j;$k++)
{
if(exists($pro{$fra[$k]}))
{
$p=$p.$pro{$fra[$k]};
}
else
{
$p=$p.x;
}
}
print"$p";

perl nucleotide sequence • 9.1k views

ADD COMMENT • link updated 14.6 years ago by Sam ▴ 20 • written 14.6 years ago by Payal ▴ 150

Ram · Answer 1 · 2010-12-29

the protein back-translation is not a straightforward thing to do, since there are several codons that code for the same amino acids. although it is fairly simple in perl to translate codons into amino acids, due to the "n to 1" nature of the genetic code assumptions have to be made. I am pretty sure you will find more information on this googling around, but I found an interesting site that explains some approximations that are to be performed on such a task. you may also see how a known protein sequence handling web tool deals with this issue, stating this in its description:

Reverse Translate accepts a protein sequence as input and uses a codon usage table to generate a DNA sequence representing the most likely non-degenerate coding sequence. A consensus sequence derived from all the possible codons for each amino acid is also returned.

by the way, this code you use tries to create several times the same hash keys, and it would result in a overwriting useless effort:

%pro=('I'=>'AUU','I'=>'AUC','I'=>'AUA','L'=>'CUU','L'=>'CUC','L'=>'CUA',
      'L'=>'CUG','L'=>'UAA','L'=>'UUG','V'=>'GUU','V'=>'GUC','V'=>'GUA',
      'V'=>'GUG','F'=>'UUU','F'=>'UUC','M'=>'AUG','C'=>'UGU','C'=>'UGC',
      'A'=>'GCU','A'=>'GCC','A'=>'GCA','A'=>'GCG','G'=>'GGU','G'=>'GGC',
      'G'=>'GGA','G'=>'GGG','P'=>'CCU','P'=>'CCC','P'=>'CCA','P'=>'CGG',
      'T'=>'ACU','T'=>'ACC','T'=>'ACA','T'=>'ACG','S'=>'TCT','S'=>'TCC',
      'S'=>'TCA','S'=>'TCG','S'=>'AGT','S'=>'AGC','Y'=>'TAT','Y'=>'TAC',
      'W'=>'TGG','Q'=>'CAA','Q'=>'CAG','N'=>'AAT','N'=>'AAC',
      'H'=>'CAT','H'=>'CAC','E'=>'GGA','E'=>'GAG','D'=>'GAT',
      'D'=>'GAC','K'=>'AAA','K'=>'AAG','R'=>'CGT','R'=>'CGC',
      'R'=>'CGA','R'=>'CGG','R'=>'AGA','R'=>'AGG','*'=>'TAA',
      '*'=>'TAG','*'=>'TGA');

Ram · Answer 2 · 2010-12-29

If you don't mind having IUPAC single character ambiguity symbols in your back translated sequence, you can do something like this:

use strict;
use warnings;
use Bio::Phylo::Matrices::Datatype;

#########################################################
# This is just an example, protein seq: 
# http://www.ncbi.nlm.nih.gov/nuccore/U53575.1
my $protseq = << 'PROTEIN';
MTNIRKNHPLMKIMNSSFIDLPTPSNISSWWNFGSLLGACLALQ
IITGLFLAMHYTADTTTAFSSVTHICRDVNYGWVIRYLHANGASMFFLCLFIHIGRGL
YYGSFTLSETWNIGIILLFTVMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTNL
VEWIWGGFSVDKATLTRFFAFHFILPFIIATLVMVHLLFLHETGSNNPLGTSSDSDKI
PFHPYYTIKDLLGLMLLILLTMTLVLFSPDLLGDPDNYSPANPLSTPPHIKPEWYFLF
AYAILRSIPNKLGGVLALVFSILILAIIPMLHTAKQRSMVFRPLSQYLFWILTADLFI
LTWIGGQPVEHPFITIGQMASILYFSLILIMMPVVSLIENKMLKW
PROTEIN
$protseq =~ s/\n//g;    
my @protseq = split //, $protseq;
#########################################################

# notice how this works around the clashes that Jorge
# mentions: now multiple codons are assigned to the 
# same key
my %pro = (
    'I' => [ qw'AUU AUC AUA' ], 
    'L' => [ qw'CUU CUC CUA CUG UAA UUG' ],
    'V' => [ qw'GUU GUC GUA GUG' ],
    'F' => [ qw'UUU UUC' ],
    'M' => [ qw'AUG' ],
    'C' => [ qw'UGU UGC' ],
    'A' => [ qw'GCU GCC GCA GCG' ],
    'G' => [ qw'GGU GGC GGA GGG' ],
    'P' => [ qw'CCU CCC CCA CGG' ],
    'T' => [ qw'ACU ACC ACA ACG' ],
    'S' => [ qw'TCT TCC TCA TCG AGT AGC' ],
    'Y' => [ qw'TAT TAC' ],
    'W' => [ qw'TGG' ],
    'Q' => [ qw'CAA CAG' ],
    'N' => [ qw'AAT AAC' ],
    'H' => [ qw'CAT CAC' ],
    'E' => [ qw'GGA GAG' ],
    'D' => [ qw'GAT GAC' ],
    'K' => [ qw'AAA AAG' ],
    'R' => [ qw'CGT CGC CGA CGG AGA AGG' ],
    '*' => [ qw'TAA TAG TGA' ],
);

# this requires Bio::Phylo, from http://search.cpan.org/dist/Bio-Phylo
my $dna = Bio::Phylo::Matrices::Datatype->new('dna');

# this will be the end result, the backtranslated DNA sequence
my $dnaseq = '';

# we loop over the amino acids in a single sequence
for my $aa ( @protseq ) {

    # we look up the codons from that amino acid in the %pro hash
    my @codons = @{ $pro{$aa} };

    # we iterate over the codon positions
    for my $pos ( 0 .. 2 ) {

        # this will hold the nucleotide symbols for the focal codon position
        my %ambig;

        # we iterate over the focal position for all codons
        for my $codon ( @codons ) {

            # fetch the focal position of the codon
            my $nuc = substr $codon, $pos, 1;

            # we need to repace U with T because that's how Bio::Phylo
            # maps things
            $nuc = 'T' if $nuc eq 'U';

            # the value 1 of the hash doesn't matter, we just want unique keys
            $ambig{$nuc} = 1;
        }

        # @states will hold 1 to 4 unique symbols, 
        # depending on the degeneracy of the site
        my @states = keys %ambig;

        # look up the IUPAC ambiguity symbol for the set of states
        my $symbol = $dna->get_symbol_for_states(@states);

        # append symbol to our growing backtranslated sequence
        $dnaseq .= $symbol;
    }
}

# we map T back to U (because that seems to be what you're after)
$dnaseq =~ tr/T/U/;

# done!
print $dnaseq, "\n";

score 2 · Answer 3 · 2011-07-29

Back translation is not an easy subject if you intend to use the obtained DNA sequence to do DNA sequence alignment and thus if codon usage is of importance. Please look at the nice work at http://bioinfo.lifl.fr/path/.
As already stated, "do not reinvent the wheel" is a very good rule, because you will got a solution quickly and you can TRUST it because tests were done and many people already used it (and may have complain or find bugs). For processing a few sequences, use EMBOSS web portals or http://www.bioinformatics.org/sms2/index.html and look at interesting sites http://www.kazusa.or.jp/codon/.
In my case, I just wanted to get DNA sequences to append to DNA sequences of interest and all those will be translated during the real processing. So, I don't really care about codon usage. And as I need an independent code, I rewrote a short script, but I tested it against good references.
The Perl codes given here contains errors, a mix of U and T is just the top of the iceberg. Hopefully it fooled me only 5 minutes, and I started from scratch. BE VERY CAREFUL WHEN USING INTERNET CODES, even if it looks nicely formatted.

score 1 · Answer 4 · 2011-03-09

1

Entering edit mode

14.4 years ago

Jvb ▴ 50

You could write a script to call the EMBOSS backtranseq or backtransambig tools.

ADD COMMENT • link 14.4 years ago by Jvb ▴ 50