Perl Script - For 3Rd Codon G Or C ?
2
6
Entering edit mode
14.4 years ago
Ads-Osu ▴ 60

So I am trying to write a perl script that gives me the number of Gs or Cs in the third codon position. I have started using a bioperl package but cannot get it to output the name of the sequences in the FASTA.

Also, I have been unsuccessful in extracting the third position thus far. I have got to the point of separating out three base pairs at a time. Here is my code:

#!/usr/bin/perl 
use Bio::SeqIO;
$seqio_obj = Bio::SeqIO->new( -file => "./all_ORFS.fasta" , '-format' => 'Fasta');

 while ($seq_obj = $seqio_obj->next_seq){
    #print the sequence
    $seq1 = $seq_obj->seq,"\n";
    $gc = 0;
    foreach ($seq1){
        @chunk = unpack("A3" x (length($seq1)/3), $seq1);
        while ($chunk){
            if(/\w{2}(\w)/){
                if($1 = "g"){
                    $gc + 1;
                }
            print $gc;
            }
        }
        #print join("\n",@chunk)."\n";    
    }
}
perl codon bioperl sequence translation • 4.8k views
ADD COMMENT
0
Entering edit mode

Well, for starters... $1 = "g" is not the same as $1 == 'g'. I'm pretty sure that $gc + 1 is not the right syntax either. And if $chunk { isn't going to work either, because I'm pretty sure you haven't initialized $chunk. If you started your script with use strict that would be caught.

ADD REPLY
8
Entering edit mode
14.4 years ago
Neilfws 49k

Good to see that you are using Bioperl for this task. Rule 1: don't reinvent the wheel. Here is one solution. It assumes that your input sequence length is a multiple of 3 and contains only codons, beginning with the start codon.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $fasta = Bio::SeqIO->new(-file => "orf1.fa", -format => "fasta");

while(my $seq = $fasta->next_seq) {
  # should really check here that $seq is as expected
  my @output = $seq->id;
  my %codons;
  for(my $i = 3; $i <= $seq->length; $i+=3) {
    my $codon3 = $seq->subseq($i, $i);
       $codons{$codon3} += 1;
  }
  foreach my $base(sort keys %codons) {
    push(@output, $codons{$base});
  }
  print join("\t", @output), "\n";
}

First, we read in a fasta file (orf1.fa) and loop through each sequence, using the Bio::SeqIO next_seq() method. We get the ID of the sequence (that's the part directly following the ">" in the fasta file) using $seq->id and store it as the first element of array @output.

Next, we loop over the sequence 3 bases at a time, starting with the 3rd base. We use the subseq() method, specifying start = end, to extract only base 3 for each codon. We store each base as a key in the hash %codons and increment the value for that key each time we see that base.

Finally we loop through the keys of %codons, sorting in the order A, C, G, T and push the count for each base onto @output. Then we can print out the sequence ID and the four counts by joining the array elements with a tab. Testing with the coding region of this sequence saved as file orf1.fa gives this output:

nirS    33    298    222    44
ADD COMMENT
6
Entering edit mode
14.4 years ago
brentp 24k

this line:

if($1 = "g"){

should probably be:

if($1 eq "g"){

but you can simplify by doing:

if (/\w\wg/){
    $gc += 1;
 }

Also, I dont have a feel for speed in perl, but it might be faster to just increment a number by 3 and test if substr($seq, $number, 1) eq 'g'

Also note that where your comment says something about printing, you're not actually call print.

ADD COMMENT

Login before adding your answer.

Traffic: 1607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6