I'm struggling with finding a way to append the following string "UMI_" to the 8th field in every header in my FASTQ files, and to edit the gzipped files in place. I used bcl2fastq's UMI trimming tool to trim the first 23 bp of read 2 which contains the QIAseq 12bp random UMI sequence followed by the 11bp common ligation sequence. I'm also wondering about trimming the latter 11bp as it may cause issues with read collapsing if they are not all identical.
I'm starting with:
@NB551615:187:HWHW5BGXF:1:12379:1239:ATGCACAGCCTGAAACGAGTCCG 1:N:0:ATCTCAGG+CGGAGAGA
going to this:
@NB551615:187:HWHW5BGXF:1:12379:1239:UMI_ATGCACAGCCTGAAACGAGTCCG 1:N:0:ATCTCAGG+CGGAGAGA
I've looked into using awk and sed but awk is messy with printing the other lines it is not supposed to edit and sed has trouble finding the 8th field with the UMI sequence.
Thanks, Ivan
What is the ultimate goal here? Perhaps we can suggest an alternate solution.
The goal is to prep the fastqs for bcbio-nextgen umi pipeline, which requires the UMI tag be added. https://bcbio-nextgen.readthedocs.io/en/latest/contents/somatic_variants.html See the section on UMIs