Illumina's bcl2fastq tool generates fastqs for barcodes that were not specified in the sample sheet. The files are named:
- Undetermined_S0_L001_R1_001.fastq.gz
- Undetermined_S0_L001_R2_001.fastq.gz
- Undetermined_S0_L001_I1_001.fastq.gz
So there is a fastq for read1, read2, and the barcode read (index1) and they are all ordered the same. My question is... What is the fastest way to get a specific barcode from this file? The best thing I can come up with is to iterate through it using python and check the index fastq for the barcode I want. Pseudocode would be something like...
barcode_of_interest = 'AGAGAGAG'
reads_of_interest = list()
for read1, read2, index1 in zip(gzipreader(Undetermined_S0_L001_R1_001.fastq.gz), gzipreader(Undetermined_S0_L001_R2_001.fastq.gz), gzipreader(Undetermined_S0_L001_I1_001.fastq.gz)):
if index1 == barcode_of_interest:
reads_of_interest.append((read1, read2))
This could work, but what if I wanted to do this faster? Is there anyway to index the read1 and read2 files in advance and use the positions in the index fastq to make extracting specified barcodes faster? Does fadix do this? Or is there any other tool out there that can do this faster than python?
There is also a previously posted solution here that uses
deML
program : A: Demultiplexing Illumina data