Fastest way to search barcode string with two mismatches in fastq file using perl?
5
4
Entering edit mode
8.6 years ago
sun.nation ▴ 140

I was trying to demultiplex fastq file using perl with two mismatchs. What module or regex is faster to search barcode in sequence? Barcode string of 12bp is searched in the sequence in fastq file. I have tried like:

my $barcode = "AATTCCGGAATT";
my $line = "AAnnCCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGGCGGCGGGCGGC";
# not only at start, can be like this too
my $line = "GGAAnnCCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGGCGGCGGGCGGC";

# I tried with pattern search 

$line =~ /\w\wTTCCGGAATT|\wA\wTCCGGAATT|\wAT\wCCGGAATT| so on for 66 combinations/

But this approach is slow. Is there any other faster solution for mismatch search in perl? Any suggestions will be highly appreciated.

perl fastq barcode demultiplex • 7.8k views
ADD COMMENT
0
Entering edit mode

I think that's about as good as you're going to get with a regex. I'm not 100% sure, but I think that even if you were to write the regex in some more concise way, the actual number of operations the regex would have to do would not be smaller.

The only other way that i can think of which would be faster would be to write a little function that, for every substring of len($barcode) in $line, compare letter-by-letter against the barcode, and if you get 3 mismatches abort to the next substring. This will be faster than the regex since it doesn't have to re-check all bases on every or. Since you are essentially doing alignment though, i'd be surprised if there wasn't a better way with an existing module.

ADD REPLY
0
Entering edit mode

In fact.... maybe a better way would be to make a list/set/hashtable of all 79 possible combinations of the barcode, and for each substring of $line just check if the substring is in that list/set/hashtable. Here is the code for python which I'm sure would be easy to port to perl:

def make_permutations(b):
    all_permutations = set([b])
    for x in range(0,len(b)):
        for y in range(0,len(b)):
            temp = list(b)
            temp[x] = 'n'
            temp[y] = 'n'
            all_permutations.add( ''.join(temp) )
    return all_permutations

def barcoded(d,b,p):
    return any([f in p for f in [d[s:e] for s,e in enumerate(range(len(b),len(d)+1))]])

and used like:

>>> b = 'AATTCCGGAATT'
>>> p = make_permutations(b)
>>> barcoded('AAnnCCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGG',b,p)
True
ADD REPLY
1
Entering edit mode
8.6 years ago
Chris Fields ★ 2.2k

I have used Text::Fuzzy for this in a demultiplexing pipeline where we wanted to allow slight mismatched in barcodes, it worked quite nicely and you can modify maximum edit distance.

EDIT: I don't think there is a really efficient way to do this using strict regex without a reasonable amount of pain and time involved.

ADD COMMENT
0
Entering edit mode

Cool module :D It is however calculating levenshtein distances (which handles insertions/deletions and all text not just DNA) so it might even be slower than the Regex since that's a pretty beefy calculation. +1 for simplicity though!

ADD REPLY
0
Entering edit mode

I found it's actually quite fast for demultiplexing runs (we've used this on microbiome samples where we split 200-300 samples). Though the module is Perl I believe the core code is C, with bindings via perl XS.

ADD REPLY
0
Entering edit mode

@Chris I am trying different ways. I have got 70M paired end reads and takes lot of time. I was wondering how long it took for you. Also, could you please post how to use the module: passing barcode and sequence in the module and getting match?

ADD REPLY
0
Entering edit mode

Why it gives same edit distance after 3 mismatches:

#!/usr/bin/perl
use warnings;
use strict;
use Text::Fuzzy;

my $barcode = "AATTCCGGAATT";
my $line1 = "AATTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC";
my $line2 = "*ATTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC";
my $line3 = "**TTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC";
my $line4 = "***TCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC";
my $line5 = "****CCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC";
my $line6 = "*****CGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC";
print "Barcode length:", length$barcode, "\nSequence length:", length$line1, "\n";
my $tf = Text::Fuzzy->new ($barcode);
print "Exact match distance is ", $tf->distance ($line1), "\n";
print "1 mismatch distance is ", $tf->distance ($line2), "\n";
print "2 mismatch distance is ", $tf->distance ($line3), "\n";
print "3 mismatch distance is ", $tf->distance ($line4), "\n";
print "4 mismatch distance is ", $tf->distance ($line5), "\n";
print "5 mismatch distance is ", $tf->distance ($line6), "\n";

Outputs:
Barcode length:12
Sequence length:60
Exact match distance is 48
1 mismatch distance is 49
2 mismatch distance is 50
3 mismatch distance is 51
4 mismatch distance is 51
5 mismatch distance is 51
ADD REPLY
0
Entering edit mode

Not an idea but that looks like a bug to me. You should report it :)

ADD REPLY
0
Entering edit mode

I was wondering how you get worked for demultiplexing? Did barcode and string had same lenght?

ADD REPLY
1
Entering edit mode
8.6 years ago
John 13k

I tried to re-work your full perl code as best I can (I don't know perl) to a simpler MCVE, and just accept a hard-coded path and a single barcode, and with 2 mismatches print out the number of occurrences:

Note: I took the line: my $barcode1_regex = make_barcode_fragments( $barcode1[$j] ); out of the loop it was in for your original code, because otherwise I think (but im not sure, since this is perl) that you're redefining your regex over and over and over..?

On Encode's ENCFF001LCU fastq file, which contains 32050852 reads @ 36bp, this took 38min 53sec (and gave a result of 8303 hits).

Then I wrote the same thing as the above in python using the code I wrote for you at the start of this thread, which looks like this:

This took 1min 24sec and also gave a result of 8303 hits.

So, assuming that I managed to grok your perl code properly, substring matching is around 28x faster than the regex, not slower.

ADD COMMENT
0
Entering edit mode

Wow, that was too fast. I was using hamming distance so that is the reason amybe. I want to try your method but I am having trouble porting your code into perl. If I am right, first make combination of barcode with 2 mismatch. After that, I was not able to understand other functions, could you please explain more? I did this for first part:

my $barcode = "AATTCCGGAATT";
my $line = "AAnnCCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGGCGGCGGGCGGC";

my $permutation = barcode_permutation($barcode);
print @$permutation[0..3], "\nBarcode permutations:", scalar@$permutation, "\n";

sub barcode_permutation {
    my ($in1) = @_;
    my @subpats;
    for my $i ( 0 .. length($in1) - 1 ) {
        for my $j ( $i + 1 .. length($in1) - 1 ) {
            my $subpat = join( '',
                substr( $in1, 0, $i ),
                'x', substr( $in1, $i + 1, $j - $i - 1 ),
                'x', substr( $in1, $j + 1 ),
            );
            push @subpats, $subpat;
        }
    }
    return \@subpats;
}

Outputs:

xxTTCCGGAATTxAxTCCGGAATTxATxCCGGAATTxATTxCGGAATT
Barcode permutations:66

Also, you are right that the subroutine should be outside the for loop. Thanks

ADD REPLY
1
Entering edit mode

Heheh, this is somewhat funny for me, because we can both program and know what we're trying to do, but since I can't understand perl and you python, this might be more tricky than I expected :P

The first bit that makes "p" (which is a set/hash table of all permutations of the barcode) doesn't have any variable characters like 'n' or 'x'. After the first bit is run, p looks like this: http://pastebin.com/raw/k35N9Au9

Then we open the file, skip a line (next(f)), then read a line (new_line = next(f)) Next we enter a loop that won't end until we hit the end of the file (which will cause an exception that we catch). There are other ways to loop over a file, but i'm using next() to skip lines without assigning anything to a variable to save some time. In your code you assigned 4 lines to 4 variables and then only used one, which was very clean and clear, but since you don't use those three other lines, best not to write them to memory if you want to be speedy.

Finally, all the magic happens on the one line:

yup += any([frag in p for frag in [new_line[s:e] for s,e in enumerate(xrange(len(b),len(new_line)+1))]])

which could be written out more verbosely as:

# create all subfragments of a line of DNA:
subfragments = []
for start,end in enumerate( range(len(b),len(new_line)+1) ):
    subfragments.append(new_line[start:end])

# record if any of those subfragments are in "p" 
matches = []
for subfragment in subfragments:
    if subfragment in p: matches.append(True)
    else: matches.append(False)

# if a single True is in the matches list, yup++
if any(matches): yup += 1

However, it's not quite the same as the above, because they way I wrote it was using lazy evaluation, which despite it's name is very fast. Basically, instead of actually making the whole subfragments list, and then making the whole matches list, the first feeds into the second and then the any(), such that if a subfragment is made that matches something in p, we immediately stop and add 1 to yup. We don't make any more subfragments, or check if those subfragments are in p.

This will be a big benefit when many of your reads contain the barcode, but even without it, you should be able to get to 20x faster speeds because the demo file I speedtested this on only found a match 8303 out of 32050852 times, and thus didn't make much use out of lazy evaluation.

ADD REPLY
1
Entering edit mode

Thanks for your input, I am trying to convert this into perl. What I understood so far: Make 12bp all possible permutations of array. From search line: make 12bp sub-strings like:

AACCGGAATTAATTTAAATTATTATTATTCTCCCGGCGGGGCGGGCGGCGGGCGGC
AACCGGAATTAA
ACCGGAATTAAT so on

Check these sub-string one by one, if present in array of barcode, stop and count the number of match (and go to next line) or otherwise go to next sub-string for search. But, I was not able to understand, how this is checking for two mismatches? Is the permutation made with two mismatches?

ADD REPLY
2
Entering edit mode

Yes, John's barcode hash table contains all 631 possible permutations of up to two mismatches (see the pastebin link he provided).

ADD REPLY
1
Entering edit mode

Yup, exactly - and because it's a hashtable and not a standard array, python will do a binary search through it. Basically, every time you double the size of the hashtable you only need to do 1 more search operation. So with 1 barcode and 600 permutations it will need 10 cpu operations to say if the fragment matches a barcode, but for 100 barcodes with 600+ permutations each (60,000 permutations total), you only need another 6 cpu operations more. So it scales very well and is fast-enough. I don't know how to do a binary search on a sorted list of strings, because otherwise that would be faster (no need to hash things), but if you can do that in perl that would be better still.

ADD REPLY
1
Entering edit mode

Thanks, I got how it is working but still having hard time to convert the code into perl. Is that fine with you if I ask online thread to convert this into perl?

ADD REPLY
1
Entering edit mode

Of course! :D

ADD REPLY
0
Entering edit mode

Thanks, I will post complete code and downstream results as I go through

ADD REPLY
0
Entering edit mode
8.6 years ago
Daniel ★ 4.0k

I would recommend the IUPAC.pm module which allows you to generate regex's based on IUPAC codes. I've previously used it for probe testing (Using Bioperl To Match Primer Sites. Difficulty With Iupac.Pm ), but it would work in your case too.

One example:

my $ambiseq_f = Bio::Seq->new(-seq => "$primer_f", -alphabet => 'dna');

# Create all possible non-degenerate sequences
my $iupac = Bio::Tools::IUPAC->new(-seq => $ambiseq_f);

my $regexp = $iupac->regexp();
print "\n$regexp\n";

EDIT

Sorry, just re-read and seen the mismatches. I did this too by making an array of regexes with N substitutions like this:

# make array of N substituted sequences
my $bc = 0;
while ($bc < $prlen-1){
    my $s1 = substr($probe, 0, $bc);
    my $s2 = substr($probe, ($bc + 1), -1);
    push (@degenarray, ($s1 . "N" . $s2));
    $bc++;
}

# Create all possible non-degenerate sequences
foreach my $degpr (@degenarray){
    my $ambiseq = Bio::Seq->new(-seq => "$degpr", -alphabet => 'dna');
    $iupac = Bio::Tools::IUPAC->new(-seq => $ambiseq);
    while (my $P = $iupac->next_seq()){
        push (@seqarray, $P->seq);
    }

}

Note: I haven't looked at this code in 3 years so not 100% sure everything is cool!

ADD COMMENT
0
Entering edit mode
8.6 years ago

Wouldn't it be faster to extract the desired test sequence substring (assuming it's at the start, as in your example), determine the hamming distance from your barcode using the XOR (^) operator, and keep the ones where h.d. < 3? Haven't tested it, but it's probably more appropriate for your application than regexes.

[Edit] Hamming distance can be determined by:

($sequence ^ $barcode) =~ tr/\001-\255//;
ADD COMMENT
0
Entering edit mode

barcode can be anywhere, not only at the start.

ADD REPLY
0
Entering edit mode

Okay, that wasn't specified in the original post. I suppose you could iterate over the sequence substrings, but matching a batch of regexes is probably faster.

Assuming your error rates are typically low, it's may be even faster to try a two-pass approach; first with a perfect regex match (which would capture the majority of desired reads), then the unmatched reads with the mismatch regexes. But it really depends on what fraction of the reads contain the barcode.

ADD REPLY
0
Entering edit mode

I tried this approach: 100% match-> 1 mismatch -> 2 mismatch , not much improvement in speed but took longer than only two mismatch regex.

ADD REPLY
0
Entering edit mode
8.6 years ago

This works for your example (I added a sequence with an internal match):

#!/usr/bin/perl
use warnings;

my $barcode = 'AATTCCGGAATT';
my @sequence = qw(AATTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC 
*ATTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC 
**TTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC 
***TCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC
****CCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC 
*****CGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC 
********************AATTCCGGAATT****************************);
my $seqstring;
my $seqlength;
my $barlength;
my $teststring;
my $hamdist;

foreach $seqstring (@sequence) {
    $seqlength = length $seqstring;
    $barlength = length $barcode;
    for $i (0..$seqlength-$barlength) {
        $teststring = substr($seqstring,$i,$barlength);
        $hamdist = ($teststring ^ $barcode) =~ tr/\001-\255//;
        if ($hamdist <= 2) {
            print "$seqstring\n";
            last;
        }
    }
}

The output is:

AATTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC
*ATTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC
**TTCCGGAATTAATTTAAATTATTATTATTCTCCCGGCTTGGGGCGGGCGGCGGGCGGC
********************AATTCCGGAATT****************************
ADD COMMENT
0
Entering edit mode

Note; if you decide to compare this approach to the multi-regex method, please benchmark and report back on the relative speeds.

ADD REPLY
0
Entering edit mode

I tried this way earlier, it is slower than the regex:

# $search is $line and $find is barcode
sub check_hd_mismatch {
    my ( $search, $find ) = @_;

    my $dis;
    for my $offset ( 0 .. length($search) - length($find) ) {
        my $hd = hd( $find, substr( $search, $offset, length($find) ) );
        if ( $hd <= 2 ) {
            $dis = $hd;
            # my $matched2 = substr( $search2, $offset2, length($find2) );
            last;
        }
    }
    return $dis;
}

sub hd {
    return ( $_[0] ^ $_[1] ) =~ tr/\001-\255//;
}

exit;
ADD REPLY
0
Entering edit mode

Thanks for checking. I assumed iteration over the substrings would be slower than regex matching; now I know.

ADD REPLY
1
Entering edit mode

If you paste all the perl needed to count the barcode in a given fastq file with 2 mismatches, we could speedtest. Personally, i'm confident that the regex is slower than substring matching.

ADD REPLY
1
Entering edit mode

This is the code I am using now for both regex and hd, just modified based on subroutine used

#!/usr/bin/perl
use strict;
use warnings;

$| = 1;

open( IN_P1, "big_file.fastq" ) or die "File not found";

my ( @sample_file_names, @barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my @barcode_file_content = <BC_FILE>;

foreach (@barcode_file_content) {
    chomp $_;
    $_ =~ s/\r//;
    $_ =~ s/\n//;

    #print $_;
    my @elements = split( "(\t|,| )", $_ );
    push @sample_file_names, $elements[0];
    push @barcode1,   $elements[2];
}

# open FH
my @fh_array_R1;
foreach (@sample_file_names) {
    chomp $_;
    local *OUT_R1;
    open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
    push @fh_array_R1, *OUT_R1;
}

# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";

while ( defined( my $firstp1 = <IN_P1> ) ) {

    my $p1_first_line  = $firstp1;
    my $p1_second_line = <IN_P1>;
    my $p1_third_line  = <IN_P1>;
    my $p1_fourth_line = <IN_P1>;

    chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );

    my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";

    for ( my $j = 0 ; $j < scalar @barcode1 ; $j++ ) {
        chomp $barcode1[$j];

        my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );

        if ( $p1_second_line =~ /$barcode1_regex/i ) {
            # keep if matched
            print { $fh_array_R1[$j] } $matched_R1;
            last;
        }
        else {
            #print to unknown;
            print UNKNOWN_R1 $matched_R1;
        }
    }
}

# make two mismatch patterm of barcode
sub make_barcode_fragments {
    my ($in1) = @_;
    my @subpats;
    for my $i ( 0 .. length($in1) - 1 ) {
        for my $j ( $i + 1 .. length($in1) - 1 ) {
            my $subpat = join( '',
                substr( $in1, 0, $i ),
                '\\w', substr( $in1, $i + 1, $j - $i - 1 ),
                '\\w', substr( $in1, $j + 1 ),
            );
            push @subpats, $subpat;
        }
    }
    my $pat = join( '|', @subpats );

    #print $pat;
    return "$pat";
}
exit;

I tried with 100k reads and 100 barcodes: time taken: 3 sec 100% match regex, 3 min 2 mismatch regex, around 4 min hd and 10 sec for Text::Fuzzy but there is something wrong while calculating edit distance because >99% reads went to first barcode and all other were empty

ADD REPLY

Login before adding your answer.

Traffic: 2033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6