Hi,
I have a multiplexed fastq file that contain reads as following:
@HISEQ:55:H76W4HIWA:1:1101:3414:2138 1:N:0:BC1:BC2:BC3 TTCCCCCAGTAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCTG + FBFFFFFIIFFIIIIIIIIIIFFIFFFFFFFFFFFFBBBBB<B7 @HISEQ:55:H76W4HIWA:1:1101:6230:2144 1:N:0:BC1:BC2:BC3 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT + FFFFFFFIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I have a quasi paired-end sequencing but the second read only contains two barcodes (BC2 and BC3). Therefore I transferred BC2 and BC3 from read2 to the header of read1 (together with BC1, part of read1 sequence). I want to demultiplex this file by the barcodes (e.g. "BC1:BC2") in the identifier line. The barcodes are known but I need to demultiplex the fastq file by allowing one mismatch for BC1 and BC2. I tried fastq-grep, but unfortunately its not possible to allow a mismatch. Have you any suggestions?
I would be very happy about every kind of help. Thank you.
ps. I can also change the delimiters between barcodes..
You can demultiplex FASTQ files while allowing mismatches in the barcodes with the tool TagDust 2, but by design it will not let you control the exact number of mismatches. (This is why I post this as a comment rather than as an answer). You can find a benchmark comparing it with other tools in its publication.
Doesn't having the barcodes in the ID line also mean that the data has been already demultiplexed and the barcode information is not actually present in the data. When the Casava pipeline (that produced this data is run) you have the choice of inputting the number of mismatches.
The reads are multiplexed. I will edit my post to make my problem a little bit more clear.
If you give some details about your experiment, it would be easy to guess whether you have demultiplexed data or not. Usually, if its illumina data, the casava pipeline would have been run on your data. Confirm with your sequencing facility.