I was trying to demultiplex fastq file using perl with two mismatchs. What module or regex is faster to search barcode in sequence? Barcode string of 12bp is searched in the sequence in fastq file. I have tried like:
my $barcode = "AATTCCGGAATT";
my $line = "AAnnCCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGGCGGCGGGCGGC";
# not only at start, can be like this too
my $line = "GGAAnnCCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGGCGGCGGGCGGC";
# I tried with pattern search
$line =~ /\w\wTTCCGGAATT|\wA\wTCCGGAATT|\wAT\wCCGGAATT| so on for 66 combinations/
But this approach is slow. Is there any other faster solution for mismatch search in perl? Any suggestions will be highly appreciated.
I think that's about as good as you're going to get with a regex. I'm not 100% sure, but I think that even if you were to write the regex in some more concise way, the actual number of operations the regex would have to do would not be smaller.
The only other way that i can think of which would be faster would be to write a little function that, for every substring of len($barcode) in $line, compare letter-by-letter against the barcode, and if you get 3 mismatches abort to the next substring. This will be faster than the regex since it doesn't have to re-check all bases on every or. Since you are essentially doing alignment though, i'd be surprised if there wasn't a better way with an existing module.
In fact.... maybe a better way would be to make a list/set/hashtable of all 79 possible combinations of the barcode, and for each substring of $line just check if the substring is in that list/set/hashtable. Here is the code for python which I'm sure would be easy to port to perl:
and used like: