Question

Adding a UMI tag to the UMI sequences in FASTQ files

0

Entering edit mode

4.4 years ago

idedios ▴ 30

I'm struggling with finding a way to append the following string "UMI_" to the 8th field in every header in my FASTQ files, and to edit the gzipped files in place. I used bcl2fastq's UMI trimming tool to trim the first 23 bp of read 2 which contains the QIAseq 12bp random UMI sequence followed by the 11bp common ligation sequence. I'm also wondering about trimming the latter 11bp as it may cause issues with read collapsing if they are not all identical.

I'm starting with:

@NB551615:187:HWHW5BGXF:1:12379:1239:ATGCACAGCCTGAAACGAGTCCG 1:N:0:ATCTCAGG+CGGAGAGA

going to this:

@NB551615:187:HWHW5BGXF:1:12379:1239:UMI_ATGCACAGCCTGAAACGAGTCCG 1:N:0:ATCTCAGG+CGGAGAGA

I've looked into using awk and sed but awk is messy with printing the other lines it is not supposed to edit and sed has trouble finding the 8th field with the UMI sequence.

Thanks, Ivan

fastq • 2.7k views

ADD COMMENT • link updated 4.4 years ago by i.sudbery 20k • written 4.4 years ago by idedios ▴ 30

0

Entering edit mode

What is the ultimate goal here? Perhaps we can suggest an alternate solution.

ADD REPLY • link 4.4 years ago by GenoMax 148k

0

Entering edit mode

The goal is to prep the fastqs for bcbio-nextgen umi pipeline, which requires the UMI tag be added. https://bcbio-nextgen.readthedocs.io/en/latest/contents/somatic_variants.html See the section on UMIs

ADD REPLY • link 4.4 years ago by idedios ▴ 30

1

Entering edit mode

4.4 years ago

swbarnes2 14k

You can probably tell bcl2fastq to ignore those 11 bases. Do you really need it to say UMI in the read name? Some tools, like umi_tools will not be expecting that.

ADD COMMENT • link 4.4 years ago by swbarnes2 14k

0

Entering edit mode

UMI tools doesn't need it to say UMI, but saying UMI wouldn't be a problem.

ADD REPLY • link 4.4 years ago by i.sudbery 20k

0

Entering edit mode

Looking back it would have been better to use bcl2fastq to ignore those 11 bases and this is what I can do for future runs. Especially since for a lot of reads, most of those bases are called as Ns and have very low phred scores.

ADD REPLY • link 4.4 years ago by idedios ▴ 30

score 2 · Accepted Answer · 2020-07-23

2

Entering edit mode

4.4 years ago

i.sudbery 20k

Its not possible to edit the gzipped files in place. You you can try:

zcat myreads.fastq.gz \
    |  sed -E 's/([0-9]+):([ATGC]{11})/\1:UMI_\2/' \
    | gzip > myreads_processed.fastq.gz

ADD COMMENT • link 4.4 years ago by i.sudbery 20k

0

Entering edit mode

Thanks! Works like a charm

ADD REPLY • link 4.4 years ago by idedios ▴ 30

0

Entering edit mode

Sorry to necro this thread, but it looks like not every read in my FASTQs is getting the UMI_ added with this. Below are a couple of reads that show this inconsistency.

@NB551615:187:HWHW5BGXF:1:11101:25141:1161:GCCTCAATATNG 1:N:0:ATCGCAGG+CGGAGAGA
GGATGCCTCTCATGAATGGCCTTATTGGATCCAGTCCTCATCTCCCACATAATTCTTTGCCACCTGGAAGCGGACTGGGAACTTTCTCTGCAATTGCACAATCCTCTTATCCTGATGCCAGGTACAAGCCTTATTTTCTATGGAACCTTA
+
AA/AAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEAEEEEEEEEEEEEEAE<AEEEEE<EEEEEEEEEEEEEEEEEEE/EEEEEAEEEEEEEEEEEE</EEAAEEAEEEE/EE<EEE/EEAAEEEA<AEAEA/EEEE
@NB551615:187:HWHW5BGXF:1:11101:17990:1164:UMI_GGGGAGATGAAT 1:N:0:ATCTCAGG+CGGAGAGA
CTCACTGACGTCGAAGGCTGCCTTCAGTGCCTGGATGTCCGTGGCCACACCGGACACGCGGTAGATGCCCACCTCCTCCATGCCTCGGCGCTCGATCTCCTCCACGCACTGGCGCACGATGTAGGGCACCTTGGACCTCTCTCTCCTGTGG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEAEEAEAEEE

EDIT: I found the cause was due to N bases in the UMI sequences, so I edited the sed command to look for ATGCN bases.

ADD REPLY • link 4.4 years ago by idedios ▴ 30

1

Entering edit mode

That sed command won't match if there's an N in the 11 bases, of course

ADD REPLY • link 4.4 years ago by swbarnes2 14k

1

Entering edit mode

Yeah, I guess you add N to the regex in sed:

zcat myreads.fastq.gz \
    |  sed -E 's/([0-9]+):([ATGCN]{11})/\1:UMI_\2/' \
    | gzip > myreads_processed.fastq.gz

ADD REPLY • link 4.4 years ago by i.sudbery 20k