Question

what is the fastest way to extract reads of a specific barcode from a fastq

0

Entering edit mode

5.9 years ago

b10hazard ▴ 30

Illumina's bcl2fastq tool generates fastqs for barcodes that were not specified in the sample sheet. The files are named:

Undetermined_S0_L001_R1_001.fastq.gz
Undetermined_S0_L001_R2_001.fastq.gz
Undetermined_S0_L001_I1_001.fastq.gz

So there is a fastq for read1, read2, and the barcode read (index1) and they are all ordered the same. My question is... What is the fastest way to get a specific barcode from this file? The best thing I can come up with is to iterate through it using python and check the index fastq for the barcode I want. Pseudocode would be something like...

barcode_of_interest = 'AGAGAGAG'
reads_of_interest = list()
for read1, read2, index1 in zip(gzipreader(Undetermined_S0_L001_R1_001.fastq.gz), gzipreader(Undetermined_S0_L001_R2_001.fastq.gz), gzipreader(Undetermined_S0_L001_I1_001.fastq.gz)):
    if index1 == barcode_of_interest:
        reads_of_interest.append((read1, read2))

This could work, but what if I wanted to do this faster? Is there anyway to index the read1 and read2 files in advance and use the positions in the index fastq to make extracting specified barcodes faster? Does fadix do this? Or is there any other tool out there that can do this faster than python?

bcl2fastq next-gen fastq demultiplexing illumina • 3.1k views

ADD COMMENT • link 5.9 years ago by b10hazard ▴ 30

0

Entering edit mode

There is also a previously posted solution here that uses deML program : A: Demultiplexing Illumina data

ADD REPLY • link 5.9 years ago by GenoMax 151k

score 1 · Answer 1 · 2019-06-20

demuxbyname.sh from BBMap suite.

$ demuxbyname.sh

Written by Brian Bushnell
Last modified May 1, 2019

Description:  Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Opposite of muxbyname.
This will crash if the number of open file handles is too high (typically over 200 or so, depending on the system).
In that case, please use demuxbyname2.sh which is slightly slower but only writes to 1 file at a time.

Usage:
demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>

Something along the lines of:

$ demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT,TAAGGCGA,...
outu=filename