Question

how to compare 2 fasta sequences with perl

0

Entering edit mode

9.2 years ago

Lila M ★ 1.3k

Hello

I've started using perl a couples of month ago.

I need to parse a fasta file with 2 sequences. I can read them and export in another txt without the header. Now, I need to compare both: their length, their amino acids.. but I dont know how to do that, this is my code:

my $fasta_file = "shortAligForDist.txt";
my $outfile = "Output.txt";

open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";

my $fh;
open($fh, $fasta_file) or die "can't open $fasta_file: $!\n";

my %sequence_data;
while (read_fasta_sequence($fh, \%sequence_data)) {
    #print ">$sequence_data{header}\n$sequence_data{seq}\n\n";
    print "\n$sequence_data{seq}\n\n";
    print OUTFILE "\n$sequence_data{seq}\n\n";
}

sub read_fasta_sequence {

    my ($fh, $seq_info) = @_;
    $seq_info->{seq} = undef; # clear out previous sequence

    # put the header into place
    $seq_info->{header} = $seq_info->{next_header} if $seq_info->{next_header};

    my $file_not_empty = 0;
    while (<$fh>) {
        $file_not_empty = 1;
        next if /^\s*$/;  # skip blank lines
        chomp;

        if (/^>/) { # fasta header line
            my $h = $_;
            $h =~ s/^>//;

            if ($seq_info->{header}) {
                $seq_info->{next_header} = $h;
                return $seq_info;
            }
            else { # first time through only
                $seq_info->{header} = $h;
            }
        }
        else {
            s/\s+//;  # remove any white space
            $seq_info->{seq} .= $_;
        }
    }

    if ($file_not_empty) {
       return $seq_info;
    }
    else {
        # clean everything up
        $seq_info->{header} = $seq_info->{seq} = $seq_info->{next_header} = undef;
        return;
    }
}

perl sequence fasta • 3.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by Lila M ★ 1.3k

2

Entering edit mode

Your question is underspecified. Unless you fully describe what you want to accomplish, people can only offer vague hints. Please try harder to describe the intended input and output of the program. "Compare the amino acids of two sequences" does not really mean anything.

Also, this smells like a homework assignment. The point of homework is to figure it out for yourself, not to ask other people for the answer.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

A "homework" flag should exist to discourage answers to this kind of questions

ADD REPLY • link 9.2 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

Hi is not a homework assignment. I am a PhD student and I'm trying to do that in order to perfor a program that avoid me to compare sequences in order to make my analysis more efficiency.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Lila M ★ 1.3k

0

Entering edit mode

Use BioPerl, unless you have a valid reason to avoid it. If you stick to plain text processing, you'll stumble a lot.

ADD REPLY • link 9.2 years ago by Ram 44k

score 0 · Answer 1 · 2015-10-14

0

Entering edit mode

9.2 years ago

Ram 44k

I'd recommend you use BioPerl. Also, comparing length is easy, to compare the amino acids, you'd need to align the sequences. You can of course compare the amino acid composition if you're just doing this for practice.

For the former, looking into the length() function. For amino acid composition, you'll have to use either ~~associative arrays~~ hashes with counters or the substitution operation (s/<AA>/<blank>/g)with the length() function (which is a neat trick). These don't need BioPerl and can help you flex your barebone string manipulation skills.

ADD COMMENT • link 2.4 years ago by Ram 44k

0

Entering edit mode

Not to be picky, but it would be better to the term hash rather than associative array (with Perl). At a high level they are a similar concept but not the same and using that term with Perl is a bit dated (I think this term was used a long, long time ago). It will certainly be easier to find relevant help for term the term hash, that was my only concern.

ADD REPLY • link 9.2 years ago by SES 8.6k

0

Entering edit mode

Hmmm. I was gonna say associative array (hash), but I guess I forgot to add that. I think I caught on to the term when I was first introduced to Perl. Hash it is!

ADD REPLY • link 9.2 years ago by Ram 44k

Ram · Answer 2 · 2015-10-14

0

Entering edit mode

9.2 years ago

venu 7.1k

You can look at this script protein_stats.pl and write your own script / modify this script. It gives the percentage of each Amino acid and some other classes of Amino acids.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by venu 7.1k

score 0 · Answer 3 · 2015-10-15

Comparing amino acid sequences may actually take a lot of work depending on if you want something beyond just length or composition. I recommend you check out the Bio::Tools::SeqStats module. It is very handy for these tasks. This will allow you to compare any number of different sequence features (amino acid composition, hydropathicity, molecular weight, length, etc.) with just a few lines of code.