Hello all,
I am trying to find the most frequent k-mers with "n" mismatches. But I'm running into a problem while creating a hash table of such k-mers.
Sample Input:ACGTTGCATGTCGCATGATGCATGAGAGCT
k-mer size:4
mismatch:1
Output: GATG ATGC ATGT
My code's logic is as follows:
1. Extract all kmers of given size
2. For each kmer, find other kmers with given mismatch range
3. Print kmers which have maximum number of fuzzy matches
I am specifically stuck on point 2 above. I am using a while
loop to iterate over all kmers. Problem is I need to count a particular kmer only once eg :
ACGT CGTT GTTG TTGC TGCA GCAT CATG ATGT TGTC GTCG TCGC CGCA GCAT CATG **ATGA** TGAT GATG ATGC TGCA GCAT CATG **ATGA** TGAG GAGA AGAG GAGC AGCT
a kmer like ATGA should be counted just once (it appears twice as I've highlighted above with ** for ease of viewing). While constructing the hash, it gets counted twice and all the fuzzy matches also get doubled. Similarly GCAT occurs thrice above, and while constructing the hash, it gets counted thrice.
I need a way to skip if the kmer already exists in the hash ( kmer being the key).
The brilliant code for fuzzy matching via subroutine was provided by Kenosis here: Perl :Fuzzy Matching Problem
My code is as under:
use 5.014;
use warnings;
use List::Util qw/max/;
$_ = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
my $len_seq = length $_;
my $len_pat = 4;
my $i = 0;
my @all_kmer;
#Extract all kmers of given size
while($i <= $len_seq - $len_pat){
my $substr = substr $_, $i, $len_pat;
push @all_kmer, $substr;
$i++
}
#Initialize hash of kmer counts
my %counts;
my $j = 0;
while ($all_kmer[$j]){
foreach (@all_kmer){
push( @{$counts{$all_kmer[$j]}},$_ ) if strDiffMaxDelta ($all_kmer[$j], $_, 1);
}
$j++;
#Skip if key already exists #Need Proper Code for this
if(exists $counts{$all_kmer[$j]} ){
say "Key already exists for $all_kmer[$j]";
$j=$j+3; # Need corrections / a way to skip to next element in @all_kmer
}
}
#Count fuzzy matches for each key in hash
my @fuzzy_kmer_counts;
foreach my $key (keys %counts){
push @fuzzy_kmer_counts, scalar @{$counts{$key}};
}
#Determine maximum number of fuzzy matches
my $max = max(@fuzzy_kmer_counts);
#Print key(s) which have maximum number of fuzzy matches
foreach my $key (keys %counts){
say "Key: $key\t" if scalar @{$counts{$key}} == $max;
print OUT $key,"\t" if scalar @{$counts{$key}} == $max;
}
###########################
#Subroutine
###########################
sub strDiffMaxDelta {
my ( $s1, $s2, $maxDelta ) = @_;
my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
return $diffCount <= $maxDelta;
}
Just to be clear, is the expected output from your script the "GATG ATGC ATGT" above, or something different?
Hi Kenosis, thanks for your comment. The expected output is indeed "GATG ATGC ATGT" for the given sample input of "ACGTTGCATGTCGCATGATGCATGAGAGCT ", Each of these kmers,with 1 mismatch. has 5 fuzzy matches. Other kmers have fewer fuzzy matches.
Hi Kenosis, Michael's code snippet solves the logic where I was stuck. I had to check the existence of the key before assigning the values in arrays.