Question

Error when fuzzy sequence matching using String::Approx in perl

2

Entering edit mode

10.4 years ago

Daniel ★ 4.0k

This is a bit complicated but I'm hitting a wall, so please bear with me...

I have written a perl script for searching a reference database with a ~30 base probe. I'm trying to expand it to allow a defined number of mismatches and this led me to String::approx.

The modifiers in the 'aindex' command allow you to define numbers of Insertions, Deletions and Substitutions (Ix,Dx,Sx). If I have these as I0,D0,S0 then I get exact matching as expected but if I use I0,D0,S1 then it matches significantly more than one substitution and I cant understand why

Relevant code snippet:

while (my $seq = $in->next_seq()){
        my $ref = $seq->seq;
        my $new_id = $seq->display_id;

         foreach my $pr (@seqarray){
                my $index = aindex($pr, ["I0","D0","S1"], $ref);
                if ($index != -1){
                        my $match = substr($ref, $index, 30);
                        print "$new_id\t$index\t$match\n";
                        last;
                }
        }

Defining the modifiers as ["I0","D0","S0"] I get:

Species_header    match_position   matching_sequence</code>

<code>Cryptosporidium_15318   813     TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_67285    820     TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_80642    822     TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_86835    822     TAGGGACAGTTGGGGGCATTTGTATTTAAC</code>

Defining the modifiers as ["I0","D0","S1"] I get:

<code>Eimeria_10865  51              TGTGTTTATTAGGTACAAAACCAACCCACT
Hepatozoon_12528    866         TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_15318    809    TTAATAGGGACAGTTGGGGGCATTTGTATT
Cryptosporidium_34679    842    TATTTAACAGCCAGAGGTGAAATTCTTAGA
Hepatozoon_54015    844         TAGGGACAGTTGGGGGCATTTGTATTTAAC

These matching sequences are real, and in the reference fasta file, but I don't understand how allowing one substitution is causing this. If anyone has any experience with this I would greatly appreciate it! I have seen the solution that Kenosis gave to Perl :Fuzzy Matching Problem which I might try to implement, but I really want to work out why this is causing such unusual results.

Thanks

UPDATE:

So it looks like changing [S0] to [S1] is causing a 4 character offset. This is what I get when changing nothing but the [S1]:

I think it must be the index being incorrectly defined under substitution conditions. Here's the data for Eimeria_10865

---------Zero Substitutions [S0]
probe: TAGGGACAGTTGGGGGCATTTGTATTTAACT
match: TAGGGACAGTTGGGGGCATTTGTATTTAACT

----------One Substitution [S1]
probe: TAGGGACAGTTGGGGGCATTTGTATTTAACT
match: TTAATAGGGACAGTTGGGGGCATTTGTATTT

Eukaryota;Alveolata;Apicomplexa;Coccidia;Eucoccidiorida;Eimeriorina;Eimeriidae;Eimeria;;Eimeria_10865:
TGGTTATTCTACGGCTAATATGTGCGCAATGGCCTCCTTCTCTGGAGGGGCTGTGTTTATTAGGTACAAAACCAACCCACTTTGGTGGACTCTTGGTGATTCATAGTAACCGAACAGATCGCAATTGGCTTGTTGTGGGCACGTTCGATTCCGGAGAGGGAAGCTGAGAAACGGCTACCACATCTAAGGAAGGCAGCAGGCGCGCAAATTACCCAATGAAAACAGTTTCGAGGTAGTGACGAGGAATAACAGTACAGGGCATTTTTTGCTCTGTAATTGGAGTAATGCGAATGRAAGACCCTTTCAGAGTAACAATTGGAGGGCACGTCTGGTGCCCGCAGCCGCTGTAGTTCCAGCTCCAATAGTGCACATTAGAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGTCGTGGTCATCCGGTCGCCCGTATGGGTGTGTGCCTGGTATGACCGCGGCTTTCCTCTGGTAGCCAGCCATCCGCGCTTAATTGCGTGTGTTGGTGTTGAACTAGGACTTTTACTTTGAGAAAAATAGACTGTTTCAAGCAGGCTGGTCGTCCTGAATACTTCAGCATGGAATATGAGGATAGAACCTCGGTTTTATTTTGTTGGTTTTTGGGACCAAGGTATTGATTAATAGGGACAGTTGGGGGCATTTGTATTTAACTGTCAGAGGTGAATTTCTTAGATTTGTTAAAGACGAACTACTGCGACACCATTTGCCAAGGAAGTTTTCATTAATCAAGAACGGTAGCAGGGGGTTCGAGGACGATTAGGCACCGTCGTAATCTCTACCATAAACTATGCGGACTACAGATAGGGAAATGCCTATCTTGGCTTCTCCTGCACCTCATGAGAAGTCGAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAGTTGACGAAAGGGCACCACCAGGCCTGGAGCCTGCGGCTTAATTTGACTCCACACGCGGGAACTCACCAGGTCCAGACATGGGAAGGATCGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCAGTTTTTAGTTGGTGGAGTGACCTGTCTGGTGAATTTCGATAACGAACGAGACCTTAGCCTGCTTAATAGGATCAAGAACCTCGATTCTCGTATCACTTTATAGAGGGACTTTGCGTGTCCAACGCAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTCAGATGTTCTGTGCTGCACACGCGCTACATTGATGCATGCAACGAGCTTTTGACCTTGGCCGACAGGTCTAGGTAATCTTTTGAGTATGCATCGTGATGGGGGTAGAATATTGCAACTATTAATCT

fuzzy-matching perl • 3.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Daniel ★ 4.0k

0

Entering edit mode

Could you give a single reproducible example by giving the query string for Eimeria_10865 and $ref?

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Michael 55k

0

Entering edit mode

I've added to the main post, thanks for looking.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Daniel ★ 4.0k

1

Entering edit mode

I have played around with this mostly as an exploration of subtle bugs that may exists in 13 or so year old libraries. The behavior is quite curious.

It is the substring TAGGGA of your pattern is what seems to cause the trouble. If you remove the end of your pattern one character at a time you have to shorten it to 5 characters for it to correctly report the index of the match. If you however remove just one letter from the beginning (just the T) it will report the index correctly for the entire pattern... it is a real headscratcher

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Istvan Albert 102k

0

Entering edit mode

10.4 years ago

SES 8.6k

This behavior may be due to the sequence lengths, and the bug report linked to in the other post suggests the same. From the documentation, "...String::Approx is not well-suited for comparing strings of different length, in other words, if you want a "fuzzy eq", see above. String::Approx is more like regular expressions or index(), it finds substrings that are close matches." In other words, you are seeing the expected behavior, if you read the documentation :). Put another way, it is as if the documentation states you should not do a comparison of A vs. B with the module because it is not designed for that, and you file a bug report saying, "hey, I did a comparison of A vs. B and I didn't get what I expected!"

Seriously though, no one reads the docs closely because hopefully a module will work as advertised (well, like you think it should). In cases like this where you are seeing strange results, that is a good time to read through the docs more closely. Also, you can see more appropriate modules referenced in the documentation for String::Approx.

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by SES 8.6k

0

Entering edit mode

I am not sure that this explains it.

The original poster's use case appears to be what the docs say:

String::Approx is more like regular expressions.

The OP is trying to find substrings that are close matches just as the docs say.

I believe that the documentation tries to discourage use cases where one would want to compute the edit distance between two strings of wildly different lengths.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Istvan Albert 102k

0

Entering edit mode

I think this does explain it, at least in part. If you look at the bug report, those sequences are not even close in length. Also, OP says, "..searching a reference database with a ~30 base probe." Is the database composed of ~30 bp sequences? That part is not completely clear to me, but I think this is a relevant point for consideration until we know more.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by SES 8.6k

1

Entering edit mode

The problem may have started in a more complicated program but it can be reduced to taking two strings and matching one onto the other. The documentation actually states that this is how one should be using the library as a regular expression pattern matching, to find out if the pattern matches and where the match occurs. I guess it is possible that correct results are only guaranteed if the strings are of equal lengths but that would be even worse prospect regarding its utility.

Just while we are at it stating that the library is not well suited for a use case should never imply that the outcome will be incorrect, only that it may be inefficient, or maybe it can't find the best solution, or maybe it can't find the pattern at all. These are all fine outcomes.

Producing completely incorrect result that does not conform at all to the input parameters better be a bug.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Istvan Albert 102k

0

Entering edit mode

I agree that these results are due to a bug and I edited my post. I was just trying to point out a likely cause that is explained in the documentation. I'll look at the code when I get a chance but I'm currently in the process of moving.

ADD REPLY • link 10.4 years ago by SES 8.6k

0

Entering edit mode

I see two scenarios here, either it is a bug, or it is the intended behaviour and the package is not very useful (at least not in bioinformatics) and the documentation is not clear, it says "is not well-suited for comparing strings of different length" it doesn't say "will give you wrong coordinates if you use aindex". Most relevant text matching applications that I can think of involve strings of different length. You note that this is a contradiction to the previous claim: "String::Approx is more like regular expressions or index(), it finds substrings that are close matches."?

Bug or not, the maintainer could have taken care of the report, if that doesn't happen I must assume that the package is unmaintained, or not?

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Michael 55k

0

Entering edit mode

The wording is vague I agree, and it would clarify the issue if someone responded to the bug report and said this is not the intended use case, or this is a bug. However, I read those caveats as saying that if you want those matches, then use a more appropriate module. I didn't read it as just there are faster ways, that was not the implication that I took away.

ADD REPLY • link 10.4 years ago by SES 8.6k

Ram · Accepted Answer · 2014-07-07

2

Entering edit mode

10.4 years ago

Michael 55k

This was reported in 2010 already, status of this bug is still new. Maybe this package shouldn't be used as it is possibly not maintained.

https://rt.cpan.org/Public/Bug/Display.html?id=57213

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Michael 55k

0

Entering edit mode

Reported by Jeremy Leipzig - that makes sense - bioinformaticians will find the corner cases whereas in natural language processing the usage is less stringent and exhaustive

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks, I hadn't seen that. I'll put this in the "Don't touch with a barge pole" pile. I'll stick with my brute force approach.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Daniel ★ 4.0k