Question

Demultiplexing with barcode at random location in sequence

1

Entering edit mode

4.0 years ago

gregoryvb ▴ 10

I have paired-end sequencing reads (Illumina) in .fastq format (I have 2 files, so not interleaved) that I want to demultiplex. They contain 2 samples which each have 2 barcodes, one for the forward read and one for the reverse read (so 4 barcodes in total). A typical read looks like this:

first_part_primer - barcode - last_part_primer - rest_of_sequence

The primer sequence before the barcode can have different lengths or sometimes it is missing completely.

I have tried using sabre with the following command:

sabre pe -f file1.fq -r file2.fq -b barcode.txt -u unmatched_file1.fq -w unmatched_file2.fq -c

where the barcode.txt is formatted as is suggested in the sabre documentation. This works fine when the barcodes are at the beginning of the reads, but when the barcodes are somewhere in the middle of the reads, it doesn't recognize them. This seems to be the case for most demultiplexing tools I have seen. I have also found demultiplex, but this gives errors during installation.

My question is how to demultiplex the reads in this case? Please let me know if you have any suggestions or solutions.

Edit

This is an example of how the reads look

@A00153:690:HJVN5DSXY:1:1101:1470:1000 1:N:0:GAACGTGA+AACTGGTG AGGTCAGTCACATGGTTAGGACGCAGATAGACAACGAAAACGAACGGGATAAAATATTTAACTTGCGGGACGGATTCAGCTCTCACTACGACCAGCACTACCTAAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACGTGAA + F:FFFF:FFFFF:F:,,F:,FF:FFFFFFFFFFFF:FFFF,FFFF,:,FFFFFFFFFFFFF,FFFFF:,F,FFFFFFFF,FFFFFFFFF,FFFF,:F,F::FF,::,F:FFF,FFFFFFF,F::FF,:,FFFFFFF:FFF,F:FF,FF::

@A00153:690:HJVN5DSXY:1:1101:3568:1000 2:N:0:GAACGTGA+AACTGGTG AGGTCAGTCACATGGTTAGGACGCAGCGAGTAAACGAAAACGAACGGGATAAATACGGTAATCGAAAACCGATACGATCGGCATAGAAAAGGTTGACAAGGAAATTGACGAATTGAAGCAGAAACTGGAAAACTTGGTAAAACAAGAAGC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF

@A00153:690:HJVN5DSXY:1:1101:4056:1000 2:N:0:GAACGTGA+AACTGGTG AGGTCAGTCACATGGTTAGGACGCAGCGAGTAAACGAAAACGAACGGGATAAATACGGTAATCGAAAACCGATACGATCCGGTCGGGTTAAAGTCGAAATCGGACGGGAACCGGTATTTTTGTTCGGTAAAATCACACATGGCTACGAAC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The first example is repeated below where the italics indicate the primer sequence and the bold text the barcode sequence.

AGGTCAGTCACATGGTTAGGACGCA GATAGACA ACGAAAACGAACGGGATAAA ATATTTAACTTGCGGGACGGATTCAGCTCTCACTACGACCAGCACTACCTAAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACGTGAA

demultiplexing • 2.5k views

ADD COMMENT • link 4.0 years ago by gregoryvb ▴ 10

0

Entering edit mode

Take a look at bbduk.sh: A: Split reads (fastq) x read barcode sequence (no barcode in RG)

You have a tricky dataset but bbduk.sh should work. In case you can't figure it out you will need to provide some example sequences.

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your comment! I have added some example reads which might clarify my question. I have tried the suggestion from the post you mentioned, but without much luck either. I seem to end up with pretty much all my reads removed, probably because this as well does not recognize barcode in the middle of the reads.

ADD REPLY • link 4.0 years ago by gregoryvb ▴ 10

0

Entering edit mode

I have been trying to create a code myself, because I don't think the problem itself is that hard. Simply searching for a specific pattern (i.e. the barcode) in each sequence and based on that divide the reads into two categories (i.e. sample 1 or sample 2). But given the size of the files (~50Gb) this is not straightforward (at least not on my computer). Currently I try to write a python code that generates a list for each read to which sample it should belong based on the barcode. Then I use a bash code to divide the reads in two files, one for each sample, based on the output of the python code. (For those who wonder why I am using both python and bash, I am not so familiar with bash that I can write the entire code there and python is not really suitable to deal with data of this size).

ADD REPLY • link 4.0 years ago by gregoryvb ▴ 10

0

Entering edit mode

Searching for a single patterns is easy. Searching for one-off errors of that pattern, or where the pattern might be partially shifted out of the read is more complicated.

Personally, I'd use trimming software already out there to do that, instead of re-inventing the wheel.

ADD REPLY • link 4.0 years ago by swbarnes2 14k

0

Entering edit mode

gregoryvb : Instead of adding the solution to your original post (as you did now) you should move it to a new answer. Just reading through it sounds pretty convoluted but if it seems to work then that is what matters. Your solution also searches only for the forward strand sequence (and that may be what you want/need).

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

I agree this solution isn't the most elegant. But at least it gives me the output that I expect and can work with. It indeed only looks for forward strand sequences which is fine for me now. But if one would need reverse strands as well, you can simply add the reverse sequences of the barcodes to the first step, right?

ADD REPLY • link 4.0 years ago by gregoryvb ▴ 10

0

Entering edit mode

That is fine. Once you are sure it produces the output you need you can accept your own answer (green checkmark) to provide closure to this thread.

ADD REPLY • link 4.0 years ago by GenoMax 147k

2

Entering edit mode

4.0 years ago

GenoMax 147k

Using your example above. Get a few bases of the primer and add them to the literal with the barcode. Should work like this. In this example I am just printing the output to screen. You will send it to a file by replacing stdout.fq with a file name.

$ bbduk.sh -Xmx2g in=test.fq literal=TGGTTAGGACGCAGATAGACA outm=stdout.fq k=15 restrictleft=50

Input is being processed as unpaired
Started output streams: 0.007 seconds.
@A00153:690:HJVN5DSXY:1:1101:1470:1000 1:N:0:GAACGTGA+AACTGGTG
AGGTCAGTCACATGGTTAGGACGCAGATAGACAACGAAAACGAACGGGATAAAATATTTAACTTGCGGGACGGATTCAGCTCTCACTACGACCAGCACTACCTAAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACGTGAA
+
F:FFFF:FFFFF:F:,,F:,FF:FFFFFFFFFFFF:FFFF,FFFF,:,FFFFFFFFFFFFF,FFFFF:,F,FFFFFFFF,FFFFFFFFF,FFFF,:F,F::FF,::,F:FFF,FFFFFFF,F::FF,:,FFFFFFF:FFF,F:FF,FF::
Processing time:        0.003 seconds.

Input:                      3 reads         450 bases.
Contaminants:               1 reads (33.33%)    150 bases (33.33%)
Total Removed:              1 reads (33.33%)    150 bases (33.33%)
Result:                     2 reads (66.67%)    300 bases (66.67%)

Time:                           0.014 seconds.
Reads Processed:           3    0.22k reads/sec
Bases Processed:         450    0.03m bases/sec

ADD COMMENT • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your answer! Your example indeed works, as long as the primer sequences are always there, but the thing is that the primer sequences were not always fully sequenced, so sometimes I am missing a large part of the primer sequence before the barcode. But I can try this anyways and see how many reads were discarded because of the missing primer sequences.

ADD REPLY • link 4.0 years ago by gregoryvb ▴ 10

0

Entering edit mode

primer sequences were not always fully sequenced

You can adjust restrictleft=50parameter and see if that helps with that.

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

Sorry for my delayed reply. I have tried this and still encounter the situation that there are too many reads lost. I suppose trimming is not accurate enough in this situation. I have created a solution based on the bbmap package that you recommended for bbduk (see the accepted answer). Thanks again for your efforts!

ADD REPLY • link 4.0 years ago by gregoryvb ▴ 10

score 1 · Accepted Answer · 2020-12-11

Inspired by the comments of both @GenoMax and @swbarnes2, I have found a solution, though possibly not the most efficient one. It is a combination of bash codes and bbsplitpairs.sh (part of the bbduk package). It uses the following steps:

grep --no-group-separator -h -A 2 -B 1 -E 'barcode1|barcode2' $file1 $file2 > $outputfile This searches for all lines in both fastq files that contain either of the two barcodes. When a line contains one of the barcodes, it stores that line (the actual sequence), the previous line (the header)(-B 1) and the next two lines (the dummy line with the + and the quality score of the read)(-A 2) to an output file. The reads from both input files are stored in a single output file (the output file is thus interleaved now).
cat $outputfile | paste - - - - | sort -k1,1 -S 1G | tr '\t' '\n' > $outputfilesorted The disadvantage of the first step is that the read pairs are not ordered. All the downstream processes expect that read pairs are stored together like this:
- readpair 1, forward read
- readpair 1, reverse read
- readpair 2, forward read
- readpair 2, reverse read
- etc.

To accomplish this, the reads are sorted based on their headers so that the reads that form a pair are subsequent reads. This is stored in another output file that should have the same size as the output file from the first step.

bash bbsplitpairs.sh -Xmx1g in=$outputfilesorted out=$outputpair outs=$outputsing fint After step 2 the fastq files are correctly formatted, but there are still singletons in the files (reads that do not have a paired partner because the paired partner either has a barcode that belongs to another sample or does not contain any barcodes at all). Those singleton reads have to be removed from the files in order for the downstream processes to work correctly. For this I use the bbsplitpairs tools from the bbmap package.

These three steps are repeated for all samples that are present in the data with the appropriate barcodes.

When I check the output files, the reads indeed seem to be properly split and combined.