Dear all,
I have paired-end fastq data generated with Illumina bcl2fastqv2.19 & sequenced on a Novaseq.The i5index is 7bp long, the i7 8bp long
R1.fastq.gz contains R1 101bp reads:
@A00154:125:HGKTMDMXX:1:1101:10420:1000 1:N:0:AACTGAGG+ATGCGTC
R2.fastq.gz contains 6bp UMI sequence
@A00154:125:HGKTMDMXX:1:1101:10420:1000 2:N:0:AACTGAGG+ATGCGTC
R3.fastq.gz contains R2 101bp reads:
@A00154:125:HGKTMDMXX:1:1101:10420:1000 3:N:0:AACTGAGG+ATGCGTC
In a downstream analysis I want to use UMI-tools for deduplication. However for that I need the UMI be part of the read name. @Instrument:RunID:FlowCellID:Lane:Tile:X:Y:UMI ReadNum:FilterFlag:0:IndexSequence or SampleNumber
There are tools to add a UMI to the read name when the UMI is present in the read itself. But in my case, the UMI is in a seperate fastq. How could this be achieved?
Looking at the bcl2fastq manual, I have no idea how they made the UMI its own fastq. But bcl2fastq will trim the UMI off of the beginning of the read and put it in the read name if
is in the sample sheet under "settings"
That's what we tried at first instance. However according to Illumina tech support, we couldn't do this because we were sequencing in dual index & the UMI was only in the i7. The option that you describe only work when you're sequencing single index.
I'm also curious, what bases mask did you use for the demultiplexing to get these three fastqs?