This is a bit complicated but I'm hitting a wall, so please bear with me...
I have written a perl script for searching a reference database with a ~30 base probe. I'm trying to expand it to allow a defined number of mismatches and this led me to String::approx.
The modifiers in the 'aindex' command allow you to define numbers of Insertions, Deletions and Substitutions (Ix,Dx,Sx). If I have these as I0,D0,S0 then I get exact matching as expected but if I use I0,D0,S1 then it matches significantly more than one substitution and I cant understand why
Relevant code snippet:
while (my $seq = $in->next_seq()){
my $ref = $seq->seq;
my $new_id = $seq->display_id;
foreach my $pr (@seqarray){
my $index = aindex($pr, ["I0","D0","S1"], $ref);
if ($index != -1){
my $match = substr($ref, $index, 30);
print "$new_id\t$index\t$match\n";
last;
}
}
Defining the modifiers as ["I0","D0","S0"] I get:
Species_header match_position matching_sequence</code>
<code>Cryptosporidium_15318 813 TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_67285 820 TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_80642 822 TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_86835 822 TAGGGACAGTTGGGGGCATTTGTATTTAAC</code>
Defining the modifiers as ["I0","D0","S1"] I get:
<code>Eimeria_10865 51 TGTGTTTATTAGGTACAAAACCAACCCACT
Hepatozoon_12528 866 TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_15318 809 TTAATAGGGACAGTTGGGGGCATTTGTATT
Cryptosporidium_34679 842 TATTTAACAGCCAGAGGTGAAATTCTTAGA
Hepatozoon_54015 844 TAGGGACAGTTGGGGGCATTTGTATTTAAC
These matching sequences are real, and in the reference fasta file, but I don't understand how allowing one substitution is causing this. If anyone has any experience with this I would greatly appreciate it! I have seen the solution that Kenosis gave to Perl :Fuzzy Matching Problem which I might try to implement, but I really want to work out why this is causing such unusual results.
Thanks
UPDATE:
So it looks like changing [S0] to [S1] is causing a 4 character offset. This is what I get when changing nothing but the [S1]:
I think it must be the index being incorrectly defined under substitution conditions. Here's the data for Eimeria_10865
---------Zero Substitutions [S0]
probe: TAGGGACAGTTGGGGGCATTTGTATTTAACT
match: TAGGGACAGTTGGGGGCATTTGTATTTAACT
----------One Substitution [S1]
probe: TAGGGACAGTTGGGGGCATTTGTATTTAACT
match: TTAATAGGGACAGTTGGGGGCATTTGTATTT
Eukaryota;Alveolata;Apicomplexa;Coccidia;Eucoccidiorida;Eimeriorina;Eimeriidae;Eimeria;;Eimeria_10865:
TGGTTATTCTACGGCTAATATGTGCGCAATGGCCTCCTTCTCTGGAGGGGCTGTGTTTATTAGGTACAAAACCAACCCACTTTGGTGGACTCTTGGTGATTCATAGTAACCGAACAGATCGCAATTGGCTTGTTGTGGGCACGTTCGATTCCGGAGAGGGAAGCTGAGAAACGGCTACCACATCTAAGGAAGGCAGCAGGCGCGCAAATTACCCAATGAAAACAGTTTCGAGGTAGTGACGAGGAATAACAGTACAGGGCATTTTTTGCTCTGTAATTGGAGTAATGCGAATGRAAGACCCTTTCAGAGTAACAATTGGAGGGCACGTCTGGTGCCCGCAGCCGCTGTAGTTCCAGCTCCAATAGTGCACATTAGAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGTCGTGGTCATCCGGTCGCCCGTATGGGTGTGTGCCTGGTATGACCGCGGCTTTCCTCTGGTAGCCAGCCATCCGCGCTTAATTGCGTGTGTTGGTGTTGAACTAGGACTTTTACTTTGAGAAAAATAGACTGTTTCAAGCAGGCTGGTCGTCCTGAATACTTCAGCATGGAATATGAGGATAGAACCTCGGTTTTATTTTGTTGGTTTTTGGGACCAAGGTATTGATTAATAGGGACAGTTGGGGGCATTTGTATTTAACTGTCAGAGGTGAATTTCTTAGATTTGTTAAAGACGAACTACTGCGACACCATTTGCCAAGGAAGTTTTCATTAATCAAGAACGGTAGCAGGGGGTTCGAGGACGATTAGGCACCGTCGTAATCTCTACCATAAACTATGCGGACTACAGATAGGGAAATGCCTATCTTGGCTTCTCCTGCACCTCATGAGAAGTCGAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAGTTGACGAAAGGGCACCACCAGGCCTGGAGCCTGCGGCTTAATTTGACTCCACACGCGGGAACTCACCAGGTCCAGACATGGGAAGGATCGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCAGTTTTTAGTTGGTGGAGTGACCTGTCTGGTGAATTTCGATAACGAACGAGACCTTAGCCTGCTTAATAGGATCAAGAACCTCGATTCTCGTATCACTTTATAGAGGGACTTTGCGTGTCCAACGCAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTCAGATGTTCTGTGCTGCACACGCGCTACATTGATGCATGCAACGAGCTTTTGACCTTGGCCGACAGGTCTAGGTAATCTTTTGAGTATGCATCGTGATGGGGGTAGAATATTGCAACTATTAATCT
Could you give a single reproducible example by giving the query string for
Eimeria_10865
and$ref
?I've added to the main post, thanks for looking.
I have played around with this mostly as an exploration of subtle bugs that may exists in 13 or so year old libraries. The behavior is quite curious.
It is the substring
TAGGGA
of your pattern is what seems to cause the trouble. If you remove the end of your pattern one character at a time you have to shorten it to 5 characters for it to correctly report the index of the match. If you however remove just one letter from the beginning (just theT
) it will report the index correctly for the entire pattern... it is a real headscratcher