Perl :Fuzzy Matching Problem
1
2
Entering edit mode
11.1 years ago
Neal ▴ 60

Hello all, I am looking to implement fuzzy matching in Perl using the module

use String::Approx

I need 3 mismatches. However, I am unable to implement the modifiers appropriately to get the desired outcome.

Input string:CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
Pattern ATTCTGGA
Output 6 7 26 27

So basically, I need the locations where the pattern matches input string approximately ( 3 mismatches)

My Code is as under:

use 5.014;
use warnings;
use String::Approx qw/amatch aindex/;


$_ = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "ATTCTGGA";
my $len_in = length $_;
my $len_pat = length $pat;


my $i = 0;
my @pattern;


while($i <= $len_in-$len_pat){ 
    my $substr = substr $_, $i, $len_pat;#Extract k-mer of length = $len_pat from position $i
    my $posn = index  $_,$substr, $i;#Extract start position of k-mer

    push @pattern, $substr if amatch ("$pat", ["S3"], $substr); #Push into @pattern if the extracted k-mer matches $pat by 3 substitutions 
    push @posn, $posn if amatch ("$pat", ["S3"], $substr); #Push into @posn the position, if the extracted k-mer matches $pat by 3 substitutions 

    $i++
}

say "@pattern";
say "@posn";

I would be grateful if it could be pointed out where I am going wrong. Right now, I get no output at all.

perl • 6.0k views
ADD COMMENT
7
Entering edit mode
11.1 years ago
Kenosis ★ 1.3k

You can calculate the "3 mismatches" between strings using Perl's exclusive-or operator (^) between those two strings, as it returns \x00 for each matching pair of characters, and a different value for non-matching characters. Given this, consider the following solution:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "ATTCTGGA";
my $len_in  = length $seq;
my $len_pat = length $pat;

my %posn_pattern;

for my $i ( 0 .. $len_in - $len_pat ) {

    #Extract k-mer of length = $len_pat from position $i
    my $substr = substr $seq, $i, $len_pat;

    #Save position/pattern if the extracted k-mer matches $pat by 3 or fewer substitutions
    $posn_pattern{$i} = $substr if strDiffMaxDelta( $pat, $substr, 3 );
}

say "$_ => $posn_pattern{$_}" for sort { $a <=> $b } keys %posn_pattern;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

Results:

6 => AATCCAGA
7 => ATCCAGAA
26 => ATTTCGGG
27 => TTTCGGGA

The strDiffMaxDelta subroutine counts the non-\x00 characters which indicates the number of character differences between the two sent strings, and then returns the result of the comparison of that number against the max delta (in this case 3). Of course, you can use $diffCount == $maxDelta if you want to return strings which have exactly three mismatches--in case I've misunderstood your desired results.

Hope this helps!

ADD COMMENT
0
Entering edit mode

Now that is a new one for me. Nice!

ADD REPLY
0
Entering edit mode

Many thanks Kenosis and exactly what was needed! This method is completely new for me! Would you be able to please explain this line of code my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;. Specifically what does () perform? I understand ^ is comparing $s1 and $s2. Also, it would be great if a resource explaining \x00 could be pointed out. I really would love to understand the logic. It's a brilliant code!

ADD REPLY
1
Entering edit mode

You're most welcome, nilan666! Am glad it worked for you.

Here's an explaination of that line:

my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
                 ^         ^            ^      ^
                 |         |            |      |
                 |         |            |      + - Do so globally
                 |         |            + - Match the XOR results for non-\x00 which indicates a character-pair difference
                 |         + - XOR the two strings against each other
                 + - 'Goatse' operator "=()="
  • \x00 is the hexidecimal value for the NULL character, and is the XOR operation's result when the two strings' characters are identical.
  • Goatse operator "=()=": right expression's evaluated in list context, w/results assigned to empty array "()="; that array is evaluated in scalar context "=()" w/results assigned to $diffCount
ADD REPLY
1
Entering edit mode

Erm, Goatse operator.... wow. DO NOT GOOGLE THIS TERM WITHOUT THE 'OPERATOR' TERM ON THE END. Images would be deemed NSFL (Not Safe For Life)

ADD REPLY
0
Entering edit mode

But this is an awesome bit of code, thank you.

ADD REPLY
0
Entering edit mode

So if I understand correctly, using ^ , if $s1 and $s2 are identical \x00 would be true and $diffCount would be 0; conversely when $s1 and $s2 are different, ^\x00 stores the the character difference which gets assigned to $diffcount. This is awesome stuff!

ADD REPLY
1
Entering edit mode

Hi nilan666.

Perl's XOR on strings, e.g., $s1 ^ $s2, returns \x00 for each identical character-pair at position n in those two strings. So, 'Perl' ^ 'Perl' would return '\x00\x00\x00\x00', since those two strings are identical. The regex /[^\x00]/g globally matches only those characters in the XOR results which are not \x00. The result of globally matching the non-\x00 characters in the XOR results is assigned to $diffcount above, as it indicates the number of different characters when the two strings are compared character-by-character.

ADD REPLY
0
Entering edit mode

Ah now I understand it much more clearly!! I reiterate that this code is sheer brilliance! By the way, I have used your subroutine in another code, but I am stuck on a particular logical challenge of that code. I would be grateful for your time if you could have a look here :Perl: Ignoring repetitive k-mers in hash construction

ADD REPLY

Login before adding your answer.

Traffic: 1697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6