I think I can see what you are trying to accomplish but it is confusing because the numbers in your example output are not correct (consistent with the comments above). Though, please correct me if you were trying for something else. Be aware that this code works for your pairwise comparison of sequences of the same length with no gaps, but you will need to update this code if you starting making more complicated comparisons. Your code was actually very close.
The most helpful thing I can add is to take a look at the data structure (your %matrix) to see what you would expect. Here is some code to print the matrix:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dump qw(dd);
my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';
my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';
my $matrix = {};
for (my $i = 0; $i < length $seq1; $i++){
$matrix->{substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}
dd $matrix;
Running that code shows your data:
$ perl biostars72505.pl
{
A => { A => 6, C => 2, G => 3, T => 2 },
C => { A => 3, C => 5, G => 1, T => 1 },
G => { A => 4, C => 4, G => 5, T => 5 },
T => { A => 4, C => 1, G => 2, T => 3 },
}
There are many ways to do this, but I like the flattened structure from Data::Dump and I think it is the nicest method for your wrists (less typing). If this were a really large file I were parsing, I would save this data structure and read it in later. That way you don't have to keep reading the file and constructing this data. The full code is different in that you have to control the order of keys by sorting, and change the placement of the print statements.
#!/usr/bin/env perl
use strict;
use warnings;
my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';
my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';
my $matrix = {};
for (my $i=0; $i < length $seq1; $i++){
$matrix->{substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}
print " A C G T\n";
for my $k (sort keys %$matrix) {
print $k;
for my $k2 (sort keys %{$matrix->{$k}}) {
print " ",$matrix->{$k}->{$k2};
}
print "\n";
}
This prints your matrix (notice it agrees with the data structure we printed above):
$ perl biostars72505.pl
A C G T
A 6 2 3 2
C 3 5 1 1
G 4 4 5 5
T 4 1 2 3
This code will be portable and fast but my advice would be to look into BioPerl if you plan to build up your methods to larger comparisons. I couldn't find a BioPerl method to do this exact calculation, though I'm sure it's feasible with a little coding. However, I did find some related methods that may be helpful:
Could you please explain why you need this?
Are you trying to find number of substitutions from seq1 to seq2? And in the output A to A is 3. Does it mean that, the number of times A remains the same from seq1 to seq2 is 3? Though, that's not the case in the above example. A -> A is definitely more than 3.
hi... im trying to count the number of matches and mismataches and to display it in a matrix 4 x 4 matrix like in the example above.
Also can you please explain how you can get A-A as 3 instead of 6?