I'm working on the seqc2 liquid biopsy dataset (https://www.nature.com/articles/s41587-021-00857-z, a detailed description of data can be found in https://www.nature.com/articles/s41597-022-01276-8). And when looking at a Burning Rock sequencing data (SRR13200965, https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR13200965&display=reads). The read identifier looks like:
gnl|SRA|SRR13200965.1.1A00463:54:H7JGKDMXX:1:1101:1344:1000:JEYSRQSD+VSTREIPV
And they describe their read processing like:
... After demultiplex and moving 6-bp UMI to the sequence header using bcl2fastq14 v2.20 (Illumina) ...
So my question is: how to get the UMI? I guess the 'JEYSRQSD+VSTREIPV' part in read identifier should be relevant. But they are not 'bases' neither has length 6. How to inteprete those data? Thanks
That's likely the raw read; and the UMI is likely contained in the sequence itself. So to get the UMI, repeat their approach and use
bcl2fastq14
Thank for reply. I'm a bit confused, did bcl2fastq a tool converting bcl to fastq? I guess it cannot do things like 'moving first 6 bases to read headers' from fastq.
True. You could potentially use UMItools --extract to do this