I have paired-end sequencing reads (Illumina) in .fastq format (I have 2 files, so not interleaved) that I want to demultiplex. They contain 2 samples which each have 2 barcodes, one for the forward read and one for the reverse read (so 4 barcodes in total). A typical read looks like this:
first_part_primer - barcode - last_part_primer - rest_of_sequence
The primer sequence before the barcode can have different lengths or sometimes it is missing completely.
I have tried using sabre with the following command:
sabre pe -f file1.fq -r file2.fq -b barcode.txt -u unmatched_file1.fq -w unmatched_file2.fq -c
where the barcode.txt is formatted as is suggested in the sabre documentation. This works fine when the barcodes are at the beginning of the reads, but when the barcodes are somewhere in the middle of the reads, it doesn't recognize them. This seems to be the case for most demultiplexing tools I have seen. I have also found demultiplex, but this gives errors during installation.
My question is how to demultiplex the reads in this case? Please let me know if you have any suggestions or solutions.
Edit
This is an example of how the reads look
@A00153:690:HJVN5DSXY:1:1101:1470:1000 1:N:0:GAACGTGA+AACTGGTG AGGTCAGTCACATGGTTAGGACGCAGATAGACAACGAAAACGAACGGGATAAAATATTTAACTTGCGGGACGGATTCAGCTCTCACTACGACCAGCACTACCTAAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACGTGAA + F:FFFF:FFFFF:F:,,F:,FF:FFFFFFFFFFFF:FFFF,FFFF,:,FFFFFFFFFFFFF,FFFFF:,F,FFFFFFFF,FFFFFFFFF,FFFF,:F,F::FF,::,F:FFF,FFFFFFF,F::FF,:,FFFFFFF:FFF,F:FF,FF::
@A00153:690:HJVN5DSXY:1:1101:3568:1000 2:N:0:GAACGTGA+AACTGGTG AGGTCAGTCACATGGTTAGGACGCAGCGAGTAAACGAAAACGAACGGGATAAATACGGTAATCGAAAACCGATACGATCGGCATAGAAAAGGTTGACAAGGAAATTGACGAATTGAAGCAGAAACTGGAAAACTTGGTAAAACAAGAAGC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF
@A00153:690:HJVN5DSXY:1:1101:4056:1000 2:N:0:GAACGTGA+AACTGGTG AGGTCAGTCACATGGTTAGGACGCAGCGAGTAAACGAAAACGAACGGGATAAATACGGTAATCGAAAACCGATACGATCCGGTCGGGTTAAAGTCGAAATCGGACGGGAACCGGTATTTTTGTTCGGTAAAATCACACATGGCTACGAAC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The first example is repeated below where the italics indicate the primer sequence and the bold text the barcode sequence.
AGGTCAGTCACATGGTTAGGACGCA GATAGACA ACGAAAACGAACGGGATAAA ATATTTAACTTGCGGGACGGATTCAGCTCTCACTACGACCAGCACTACCTAAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACGTGAA
Take a look at
bbduk.sh
: A: Split reads (fastq) x read barcode sequence (no barcode in RG)You have a tricky dataset but
bbduk.sh
should work. In case you can't figure it out you will need to provide some example sequences.Thanks for your comment! I have added some example reads which might clarify my question. I have tried the suggestion from the post you mentioned, but without much luck either. I seem to end up with pretty much all my reads removed, probably because this as well does not recognize barcode in the middle of the reads.
I have been trying to create a code myself, because I don't think the problem itself is that hard. Simply searching for a specific pattern (i.e. the barcode) in each sequence and based on that divide the reads into two categories (i.e. sample 1 or sample 2). But given the size of the files (~50Gb) this is not straightforward (at least not on my computer). Currently I try to write a python code that generates a list for each read to which sample it should belong based on the barcode. Then I use a bash code to divide the reads in two files, one for each sample, based on the output of the python code. (For those who wonder why I am using both python and bash, I am not so familiar with bash that I can write the entire code there and python is not really suitable to deal with data of this size).
Searching for a single patterns is easy. Searching for one-off errors of that pattern, or where the pattern might be partially shifted out of the read is more complicated.
Personally, I'd use trimming software already out there to do that, instead of re-inventing the wheel.
gregoryvb : Instead of adding the solution to your original post (as you did now) you should move it to a new answer. Just reading through it sounds pretty convoluted but if it seems to work then that is what matters. Your solution also searches only for the forward strand sequence (and that may be what you want/need).
I agree this solution isn't the most elegant. But at least it gives me the output that I expect and can work with. It indeed only looks for forward strand sequences which is fine for me now. But if one would need reverse strands as well, you can simply add the reverse sequences of the barcodes to the first step, right?
That is fine. Once you are sure it produces the output you need you can accept your own answer (green checkmark) to provide closure to this thread.