Question

Perl Script - For 3Rd Codon G Or C ?

6

Entering edit mode

14.4 years ago

Ads-Osu ▴ 60

So I am trying to write a perl script that gives me the number of Gs or Cs in the third codon position. I have started using a bioperl package but cannot get it to output the name of the sequences in the FASTA.

Also, I have been unsuccessful in extracting the third position thus far. I have got to the point of separating out three base pairs at a time. Here is my code:

#!/usr/bin/perl 
use Bio::SeqIO;
$seqio_obj = Bio::SeqIO->new( -file => "./all_ORFS.fasta" , '-format' => 'Fasta');

 while ($seq_obj = $seqio_obj->next_seq){
    #print the sequence
    $seq1 = $seq_obj->seq,"\n";
    $gc = 0;
    foreach ($seq1){
        @chunk = unpack("A3" x (length($seq1)/3), $seq1);
        while ($chunk){
            if(/\w{2}(\w)/){
                if($1 = "g"){
                    $gc + 1;
                }
            print $gc;
            }
        }
        #print join("\n",@chunk)."\n";    
    }
}

perl codon bioperl sequence translation • 4.8k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 14.4 years ago by Ads-Osu ▴ 60

0

Entering edit mode

Well, for starters... $1 = "g" is not the same as $1 == 'g'. I'm pretty sure that $gc + 1 is not the right syntax either. And if $chunk { isn't going to work either, because I'm pretty sure you haven't initialized $chunk. If you started your script with use strict that would be caught.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 6.2 years ago by swbarnes2 14k

Ram · Answer 1 · 2010-07-14

Good to see that you are using Bioperl for this task. Rule 1: don't reinvent the wheel. Here is one solution. It assumes that your input sequence length is a multiple of 3 and contains only codons, beginning with the start codon.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $fasta = Bio::SeqIO->new(-file => "orf1.fa", -format => "fasta");

while(my $seq = $fasta->next_seq) {
  # should really check here that $seq is as expected
  my @output = $seq->id;
  my %codons;
  for(my $i = 3; $i <= $seq->length; $i+=3) {
    my $codon3 = $seq->subseq($i, $i);
       $codons{$codon3} += 1;
  }
  foreach my $base(sort keys %codons) {
    push(@output, $codons{$base});
  }
  print join("\t", @output), "\n";
}

First, we read in a fasta file (orf1.fa) and loop through each sequence, using the Bio::SeqIO next_seq() method. We get the ID of the sequence (that's the part directly following the ">" in the fasta file) using $seq->id and store it as the first element of array @output.

Next, we loop over the sequence 3 bases at a time, starting with the 3rd base. We use the subseq() method, specifying start = end, to extract only base 3 for each codon. We store each base as a key in the hash %codons and increment the value for that key each time we see that base.

Finally we loop through the keys of %codons, sorting in the order A, C, G, T and push the count for each base onto @output. Then we can print out the sequence ID and the four counts by joining the array elements with a tab. Testing with the coding region of this sequence saved as file orf1.fa gives this output:

nirS    33    298    222    44

Ram · Answer 2 · 2010-07-14

this line:

if($1 = "g"){

should probably be:

if($1 eq "g"){

but you can simplify by doing:

if (/\w\wg/){
    $gc += 1;
 }

Also, I dont have a feel for speed in perl, but it might be faster to just increment a number by 3 and test if substr($seq, $number, 1) eq 'g'

Also note that where your comment says something about printing, you're not actually call print.