Hello all, I am looking to implement fuzzy matching in Perl using the module
use String::Approx
I need 3 mismatches. However, I am unable to implement the modifiers appropriately to get the desired outcome.
Input string:CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
Pattern ATTCTGGA
Output 6 7 26 27
So basically, I need the locations where the pattern matches input string approximately ( 3 mismatches)
My Code is as under:
use 5.014;
use warnings;
use String::Approx qw/amatch aindex/;
$_ = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "ATTCTGGA";
my $len_in = length $_;
my $len_pat = length $pat;
my $i = 0;
my @pattern;
while($i <= $len_in-$len_pat){
my $substr = substr $_, $i, $len_pat;#Extract k-mer of length = $len_pat from position $i
my $posn = index $_,$substr, $i;#Extract start position of k-mer
push @pattern, $substr if amatch ("$pat", ["S3"], $substr); #Push into @pattern if the extracted k-mer matches $pat by 3 substitutions
push @posn, $posn if amatch ("$pat", ["S3"], $substr); #Push into @posn the position, if the extracted k-mer matches $pat by 3 substitutions
$i++
}
say "@pattern";
say "@posn";
I would be grateful if it could be pointed out where I am going wrong. Right now, I get no output at all.
Now that is a new one for me. Nice!
Many thanks Kenosis and exactly what was needed! This method is completely new for me! Would you be able to please explain this line of code
my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
. Specifically what does()
perform? I understand^
is comparing $s1 and $s2. Also, it would be great if a resource explaining\x00
could be pointed out. I really would love to understand the logic. It's a brilliant code!You're most welcome, nilan666! Am glad it worked for you.
Here's an explaination of that line:
=()=
": right expression's evaluated in list context, w/results assigned to empty array "()="; that array is evaluated in scalar context "=()" w/results assigned to$diffCount
Erm, Goatse operator.... wow. DO NOT GOOGLE THIS TERM WITHOUT THE 'OPERATOR' TERM ON THE END. Images would be deemed NSFL (Not Safe For Life)
But this is an awesome bit of code, thank you.
So if I understand correctly, using
^
, if$s1
and$s2
are identical\x00
would be true and$diffCount
would be 0; conversely when$s1
and$s2
are different,^\x00
stores the the character difference which gets assigned to$diffcount
. This is awesome stuff!Hi nilan666.
Perl's XOR on strings, e.g.,
$s1 ^ $s2
, returns \x00 for each identical character-pair at position n in those two strings. So,'Perl' ^ 'Perl'
would return '\x00\x00\x00\x00', since those two strings are identical. The regex/[^\x00]/g
globally matches only those characters in the XOR results which are not \x00. The result of globally matching the non-\x00 characters in the XOR results is assigned to$diffcount
above, as it indicates the number of different characters when the two strings are compared character-by-character.Ah now I understand it much more clearly!! I reiterate that this code is sheer brilliance! By the way, I have used your subroutine in another code, but I am stuck on a particular logical challenge of that code. I would be grateful for your time if you could have a look here :Perl: Ignoring repetitive k-mers in hash construction