perl script for multifasta sequence analysis
2
0
Entering edit mode
9.6 years ago

I have a perl script which gives the sum result of all the sequences present in the multifasta (.txt format) input file. please help me in modifying the code in such a way that it gives result separately for each sequence.

Input sequence:

>seq1
GADD
>seq2
MMGAAL

Observed output:

GA=2
AL= 1
MM= 1
.
.
.
.
.
Sum=4
Abs=19

Expected output:

Seq1:
GA= 1
AL = 0
.
.
.
Sum=1
Abs=21

Seq2:
GA=1
AL=1
MM=1
.
.
.
Sum=3
Abs=19

Code:

#!/usr/bin/perl -w
print "Please type the filename of the DNA sequence MMta: ";
$str= <STDIN>;
chomp $str;
unless ( open(DNAFILE, $str) )
{
print "Cannot open file \"$str\"\n\n";
exit;
}
@str = <DNAFILE>;
$str=join('',@str);
print"\n DNA:\n
$str\n";

$GA = 0;
$AL = 0;
$MM = 0;
$DE = 0;
$DV = 0;
$VD = 0;
$DW = 0;
$QD = 0;
$SD = 0;
$DD = 0;
$ED = 0;
$DY = 0;
$VE = 0;
$EN = 0;
$II = 0;
$KE = 0;
$NV = 0;
$VP = 0;
$FV = 0;
$SS = 0;
$WK = 0;
$KK = 0;
$abs=0;

while ($str =~ /GA/ig)
{$GA++}
while ($str =~ /AL/ig)
{$AL++}
while ($str =~ /MM/ig)
{$MM++}
while ($str =~ /DE/ig)
{$DE++}
while ($str =~ /DV/ig)
{$DV++}
while ($str =~ /VD/ig)
{$VD++}
while ($str =~ /DW/ig)
{$DW++}
while ($str =~ /QD/ig)
{$QD++}
while ($str =~ /SD/ig)
{$SD++}
while ($str =~ /DD/ig)
{$DD++}
while ($str =~ /ED/ig)
{$ED++}
while ($str =~ /DY/ig)
{$DY++}
while ($str =~ /VE/ig)
{$VE++}
while ($str =~ /EN/ig)
{$EN++}
while ($str =~ /II/ig)
{$II++}
while ($str =~ /KE/ig)
{$KE++}
while ($str =~ /NV/ig)
{$NV++}
while ($str =~ /VP/ig)
{$VP++}
while ($str =~ /FV/ig)
{$FV++}
while ($str =~ /SS/ig)
{$SS++}
while ($str =~ /WK/ig)
{$WK++}
while ($str =~ /KK/ig)
{$KK++}
$total= "$GA+$AL+$MM+$DE+$DV+$VD+$DW+$QD+$SD+$DD+$ED+$DY+$VE+$EN+$II+$KE+$NV+$VP+$FV+$SS+$WK+$KK";
while ($total=~ /0/ig)
{$abs++}
$sum= $GA+$AL+$MM+$DE+$DV+$VD+$DW+$QD+$SD+$DD+$ED+$DY+$VE+$EN+$II+$KE+$NV+$VP+$FV+$SS+$WK+$KK;
print "GA = $GA\n";
print "AL = $AL\n";
print "WK = $MM\n";
print "DE = $DE\n";
print "VP = $DV\n";
print "VD = $VD\n";
print "DW = $DW\n";
print "QD = $QD\n";
print "SD = $SD\n";
print "DD = $DD\n";
print "ED = $ED\n";
print "DY = $DY\n";
print "VE = $VE\n";
print "EN = $EN\n";
print "II = $II\n";
print "KE = $KE\n";
print "NV = $NV\n";
print "VP = $VP\n";
print "FV = $FV\n";
print "SS = $SS\n";
print "WK = $WK\n";
print "KK = $KK\n";
print "sum=$sum\n";
print "abs=$abs";

$outputfile = "countbase.txt";
unless ( open(COUNTBASE, ">$outputfile") ) {
print "Cannot open file \"$outputfile\" to write
to!!\n\n";
exit;
}
print COUNTBASE "$GA
$AL
$MM
$DE
$DV
$VD
$DW
$QD
$SD
$DD
$ED
$DY
$VE
$EN
$II
$KE
$NV
$VP
$FV
$SS
$WK
$KK
$sum
$abs";
close(COUNTBASE);


exit;
sequence • 4.6k views
ADD COMMENT
0
Entering edit mode

What do you mean by "sum result of all the sequences"

ADD REPLY
0
Entering edit mode
Sum count of the dipeptide i.e GA, AL.... Present in all the sequences
ADD REPLY
3
Entering edit mode
9.6 years ago
thackl ★ 3.0k

Because it's weekend. It is fully functional. I hope you can follow the code, otherwise I'm happy to explain... (Fasta parser is based on Fasta-Parser.pl)

#!/usr/bin/env perl
use warnings;
use strict;

# Usage:
#   perl dinuc.pl FILE.FA

my @dipeps_of_interest = qw(
                               GA
                               AL
                               MM
                               DE
                               DV
                               VD
                               DW
                               QD
                               SD
                               DD
                               ED
                               DY
                               VE
                               EN
                               II
                               KE
                               NV
                               VP
                               FV
                               SS
                               WK
                               KK
                       );


open(FASTA, $ARGV[0]) or die $!;

while( my $fa = next_fasta_record(\*FASTA) ){
    my %fa = fasta_record2hash($fa);

    my %dipeps = count_dipeps($fa{seq});
    # or
    # my %dipeps = count_dipeps_fast($fa{seq});

    # report dipeps counts
    print "$fa{id}\n";
    my $sum = 0;
    my $abs = 0;

    foreach (@dipeps_of_interest) {
        if (exists $dipeps{$_}) {
            $sum+= $dipeps{$_};
            print "$_\t$dipeps{$_}\n";
        }else {
            $abs++;
            print "$_\t0\n";            
        }
    }
    print "sum\t$sum\n";
    print "abs\t$abs\n";
}

close FASTA;


# count dipeps of sequence in a hash
sub count_dipeps{
    my $seq = shift;
    my %dipeps;
    while($seq =~ /(?=(\w\w))/g){ # read to chars at a time
        $dipeps{$1}++;
    }
    return %dipeps;
}

# count dipeps of sequence in a hash
# faster than regex and works on seqs >36kbp
sub count_dipeps_fast{
    my $seq = shift;
    my %dipeps;
    my $length = length($seq)-2;
    foreach(unpack("(A2X1)".$length."A2", $seq)){
        $dipeps{$_}++
    }
    return %dipeps;
}

# read a single fasta record from a file handle into a string
sub next_fasta_record{
    my $fh = shift;
    local $/="\n>"; # change record separator form "\n" to "\n>"
    my $fa = <$fh>; # read a fasta record
    return unless defined($fa); # end of file
    chomp($fa); # remove "\n>" from end of record
    $fa =~ s/^>//; # (stupid way to) fix first record in file
    return '>'.$fa;
}

# split a single fasta record into a hash
sub fasta_record2hash{
    my ($head, $seq) = split("\n", $_[0], 2); # split header and seq
    my ($id, $desc) = split(/\s/, $head, 2); # split id and desc
    $seq =~ tr/\n//d; # remove newlines
    my %fa = (
              id => $id,
              desc => $desc,
              seq => $seq,
              );
    return %fa;
}
ADD COMMENT
0
Entering edit mode

Sir,Thank you for the help.

I placed the input file (file.fa) in the same directory used the exact code as given but it is shwoing the msg

use of uninitialized value in open at C:\users\prosen\desktop\code.pl line 8.
no such file or directory at" C:\users\prosen\desktop\code.pl line 8 "

Line 8 in the code:

open(FASTA, $ARGV[0]) or die $!;

Sir please help me

ADD REPLY
0
Entering edit mode
  • Is that windows?
  • are you using strawberry perl?
  • Did you open your command prompt in the directory where the files were located? Finally did you notice this? perl dinuc.pl FILE.FA
ADD REPLY
0
Entering edit mode

Yes sir i am using windows system,

No, active per,

yes sir in the same directory,

Renamed the input file byfile.fa

ADD REPLY
0
Entering edit mode

Location and name of the fasta file don't matter, but you need to actually put the path to the fasta file in the command line call to the script:

perl dinuc.pl C:\the\path\to\my\protein.fa

Or at least the name if you first cd to the directory

perl dinuc.pl protein.fa
ADD REPLY
0
Entering edit mode

If this does not work, change

open(FASTA, $ARGV[0]) or die $!;

to

open(FASTA, "C:/path/to/my/protein.fa") or die $!;
ADD REPLY
0
Entering edit mode

Sir this code is very helpful can be used for DNA and protein both.

But sir i want to count only the 22 dipeptides., their sum and the number of di-peptide absent.......Is there any way to do so...Thanks in advance

ADD REPLY
1
Entering edit mode

Okay, still weekend ;). I modified the above code slightly. You can now specify dipeptides of interest, and sum and count are calculated

ADD REPLY
1
Entering edit mode
9.6 years ago
venu 7.1k
  • Separate each header and its associated sequence into two separate string ($id, $seq)
  • Count the occurence (GA, AL etc.) with $GA = () = $seq =~ /GA/gi;

Simple modification of this will work

ADD COMMENT
0
Entering edit mode

Sir, i used the reference code but it is showing the msg

use of uninitialized value in open at C:\users\prosen\desktop\code.pl line 7

Line 7 in the code:

open IN, $ARGV[0];

please help.

ADD REPLY

Login before adding your answer.

Traffic: 1618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6