Hi!
I'm trying to create a 20x20 Count matrix in perl for a Protein alignment. The matrix should be symmetric but I'm not getting good results. Does anyone knows how to do this right? I have wrote this so far:
Here's my Protein alignment:
>EOG50000C|AMELL_prerelease2|GB14042-PA
MA--------------------------------------LIHRVNINFGFVLNTFYCKC
IVNIKRCK----TEYNIKSQTNVKYGENITFGDMLKFRQTQAKKSVLIGV--PENYIYDM
EKYCLKHVNIKKILFYNTET----NRCFVLMELMTEKDVEIFRNMASCK----------I
NSNAHSVTPLFLY-----------------KVRNSSYNKPMTKKSIMFKPPNFDQIKES-
LKNKN--TISEQMIILHHLLKITDLDIRLRFFTAEQISYYLSRLFMNVSVIPFGSSVNGF
GQIGCDLDLLY-------------------------------RNEQKEFLDAISTIMKIC
IPGIDNIKKILEARVPIIKFFNQTTNMHCDLSSTNVIALHMSELLYTYGELDWRVKPLIC
TIRKWARNTNITKDI-SGQWITNFSLTLLILFYLQIK-----NILPSLKVIKYHIKN---
S-------------------------RRSWSND-----WKNSINYNNDSLHNLLYGFFEY
YSIFDFKMQAICIKEGKPRVKKD--SSPLYIYNPFDESLNVSKNITIFELTRLIDHFRKA
LQILIETS----------------EKDAIIKLIKINSISSNSIEMDDCK----DIEKSTE
IDNIEKDSDTFNQKES--------LNESINKINVNVLETMEKNIIK--------------
------------------------------------------------------------
---------------------------------------
>EOG50000C|PHUMA_1.2|PHUM400420-PA
M-----------------------------------------------------------
------------------------------------------------------------
---------------YKFET---NSMKYVLLEMENGEQLSNLINHSKSHGGGGGGGKRGK
FSILNSLGGHFGYFNDSIVYDHEKIIPTTTVMENYDYYYSNVNNND--DNSKTNSIRKL-
LKEYKTLDLDNQIRLLHDSYKLNDLGYRLRYMMWKQIQNLVNVLLPTHSVYPFGSAVNGL
GDVTSDLDLVVLRDDSTS--------------FRWHKNPGVDINSVQNDLSLLTNVMTKT
SVGCAKVVFIKQAKVPIIKYKHSFTGVECDVSMHQTEALKMSQILFALSNFDPRIKPLIF
FIKIWARELRLTREQ-PGPTITNFSLILLTIFFLQQKGSSHTEILPPFDLFPLRDTD---
ANDDVMT-MENTDEKLECILNNFKCHVTSSSSSSTSSPTSSPSAPNGKSTVELLLLFYKF
YSVFNFNSYGCCLNTGQIIRKNNFKNYGMYIKNPFCSIYNVSKNVSLSEVNKFQFYCQIA
ATLLEESETTISRNSTTRTAVVDESKFLLNFLNYNSYANELMAKKLSNNNNINSNVKFAD
LKSIF-------------------------------------------------------
------------------------------------------------------------
---------------------------------------
`
Here´s my Code:
use warnings;
use Data::Dumper;
my $datei = shift @ARGV;
$datei = '/Users/Maxi/Desktop/LernenPerl/alignmentpro.txt';
open(my $fastd,'<', $datei) or die "die datei $datei wurde nicht geöffnet: $!\n";
my %sequences = ();
my $header = '';
my $reverse='';
my $dnaheaders= 0;
my $length=0;
while (my $line = <$fastd>) {
chomp $line;
if ($line =~ /^>/) {
$header = $line;
$header =~ s/>//g;
}
else {
$sequences{$header} .= $line
}
}
my $seq1= $sequences{'EOG50000C|PHUMA_1.2|PHUM400420-PA'};
my $seq2= $sequences{'EOG50000C|AMELL_prerelease2|GB14042-PA'};
my $matrix={};
LINE: for (my $i=0; $i< length $seq1; $i++){
my ($substr1,$substr2) = (substr($seq1,$i,1),substr($seq2, $i,1));
next LINE if ("$substr1$substr2" =~ /-/);
if ($substr1 eq $substr2){
$matrix->{$substr1}->{$substr2}++;
}elsif ($substr1 ne $substr2){
$matrix->{$substr1}->{$substr2} += 1/2;
$matrix->{$substr2}->{$substr1} += 1/2;
}
}
print " A C D E F G H I K L M N P Q R S T V W Y\n";
foreach my $key (sort keys %$matrix) {
print $key." ";
foreach my $key2 (sort keys %{$matrix->{$key}}) {
printf("%3d ",$matrix->{$key}->{$key2});
}
print "\n"x2;
}
print Dumper $matrix;
The matrix should see like this (not for this alignment.. just an example):
4372 121 113 162 96 140 324 387 40 136 212 219 72 50 213 596 300 6 39 428
121 2748 100 70 20 188 170 88 69 48 85 679 38 26 46 129 110 10 30 59
113 100 1727 322 16 92 166 190 89 37 42 209 17 18 38 207 160 4 40 45
162 70 322 3092 8 127 732 154 68 18 36 176 11 14 67 204 120 4 24 28
96 20 16 8 680 6 6 20 6 30 30 12 10 22 6 61 54 7 16 83
140 188 92 127 6 1367 346 58 70 35 100 310 52 18 48 128 103 6 30 66
324 170 166 732 6 346 3378 138 63 40 68 411 29 13 93 212 192 6 28 104
387 88 190 154 20 58 138 5788 30 19 36 136 14 17 46 230 78 8 12 40
40 69 89 68 6 70 63 30 1059 14 39 92 16 54 34 48 40 13 96 27
136 48 37 18 30 35 40 19 14 3217 858 72 198 144 24 55 156 12 37 1300
212 85 42 36 30 100 68 36 39 858 5061 124 410 331 62 104 133 20 92 593
219 679 209 176 12 310 411 136 92 72 124 3054 48 23 84 216 172 4 32 84
72 38 17 11 10 52 29 14 16 198 410 48 923 82 12 38 64 5 24 134
50 26 18 14 22 18 13 17 54 144 331 23 82 2191 14 38 39 59 367 101
213 46 38 67 6 48 93 46 34 24 62 84 12 14 2757 147 66 2 10 54
596 129 207 204 61 128 212 230 48 55 104 216 38 38 147 2326 490 13 28 100
300 110 160 120 54 103 192 78 40 156 133 172 64 39 66 490 2808 4 31 299
6 10 4 4 7 6 6 8 13 12 20 4 5 59 2 13 4 582 48 14
39 30 40 24 16 30 28 12 96 37 92 32 24 367 10 28 31 48 1612 42
428 59 45 28 83 66 104 40 27 1300 593 84 134 101 54 100 299 14 42 3784
hi! thank you for your answer... yes i think you are right but how do i replace all the zeros with the numbers i need? i mean if i initialize the matrix like this:
and i introduce only that code in my code i get a 20x20 matrix with zeros + what i had before
thank you for your help!
One way to work around this is, instead of using hashes you could use an array (
@matrix
). And instead of having letters A - V, you could use numbers from 0-19. And use a hash called%aminos
where each key is an amino acid and value is a number. For e.g.,$aminos{"A"} = 0
,$aminos{"C"} = 1
, etc.So your if statement in
LINE:
will be something like$matrix->[$aminos{$substr1}]->[$aminos{$substr2}]++;
And you can print using the double for loop like you used in the above comment. This way you can only deal with numbers and you can still stick with your code logic.
Did you see what I did there?