Adding a UMI tag to the UMI sequences in FASTQ files
2
0
Entering edit mode
4.4 years ago
idedios ▴ 30

I'm struggling with finding a way to append the following string "UMI_" to the 8th field in every header in my FASTQ files, and to edit the gzipped files in place. I used bcl2fastq's UMI trimming tool to trim the first 23 bp of read 2 which contains the QIAseq 12bp random UMI sequence followed by the 11bp common ligation sequence. I'm also wondering about trimming the latter 11bp as it may cause issues with read collapsing if they are not all identical.

I'm starting with:

@NB551615:187:HWHW5BGXF:1:12379:1239:ATGCACAGCCTGAAACGAGTCCG 1:N:0:ATCTCAGG+CGGAGAGA

going to this:

@NB551615:187:HWHW5BGXF:1:12379:1239:UMI_ATGCACAGCCTGAAACGAGTCCG 1:N:0:ATCTCAGG+CGGAGAGA

I've looked into using awk and sed but awk is messy with printing the other lines it is not supposed to edit and sed has trouble finding the 8th field with the UMI sequence.

Thanks, Ivan

fastq • 2.7k views
ADD COMMENT
0
Entering edit mode

What is the ultimate goal here? Perhaps we can suggest an alternate solution.

ADD REPLY
0
Entering edit mode

The goal is to prep the fastqs for bcbio-nextgen umi pipeline, which requires the UMI tag be added. https://bcbio-nextgen.readthedocs.io/en/latest/contents/somatic_variants.html See the section on UMIs

ADD REPLY
2
Entering edit mode
4.4 years ago

Its not possible to edit the gzipped files in place. You you can try:

zcat myreads.fastq.gz \
    |  sed -E 's/([0-9]+):([ATGC]{11})/\1:UMI_\2/' \
    | gzip > myreads_processed.fastq.gz
ADD COMMENT
0
Entering edit mode

Thanks! Works like a charm

ADD REPLY
0
Entering edit mode

Sorry to necro this thread, but it looks like not every read in my FASTQs is getting the UMI_ added with this. Below are a couple of reads that show this inconsistency.

@NB551615:187:HWHW5BGXF:1:11101:25141:1161:GCCTCAATATNG 1:N:0:ATCGCAGG+CGGAGAGA
GGATGCCTCTCATGAATGGCCTTATTGGATCCAGTCCTCATCTCCCACATAATTCTTTGCCACCTGGAAGCGGACTGGGAACTTTCTCTGCAATTGCACAATCCTCTTATCCTGATGCCAGGTACAAGCCTTATTTTCTATGGAACCTTA
+
AA/AAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEAEEEEEEEEEEEEEAE<AEEEEE<EEEEEEEEEEEEEEEEEEE/EEEEEAEEEEEEEEEEEE</EEAAEEAEEEE/EE<EEE/EEAAEEEA<AEAEA/EEEE
@NB551615:187:HWHW5BGXF:1:11101:17990:1164:UMI_GGGGAGATGAAT 1:N:0:ATCTCAGG+CGGAGAGA
CTCACTGACGTCGAAGGCTGCCTTCAGTGCCTGGATGTCCGTGGCCACACCGGACACGCGGTAGATGCCCACCTCCTCCATGCCTCGGCGCTCGATCTCCTCCACGCACTGGCGCACGATGTAGGGCACCTTGGACCTCTCTCTCCTGTGG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEAEEAEAEEE

EDIT: I found the cause was due to N bases in the UMI sequences, so I edited the sed command to look for ATGCN bases.

ADD REPLY
1
Entering edit mode

That sed command won't match if there's an N in the 11 bases, of course

ADD REPLY
1
Entering edit mode

Yeah, I guess you add N to the regex in sed:

zcat myreads.fastq.gz \
    |  sed -E 's/([0-9]+):([ATGCN]{11})/\1:UMI_\2/' \
    | gzip > myreads_processed.fastq.gz
ADD REPLY
1
Entering edit mode
4.4 years ago

You can probably tell bcl2fastq to ignore those 11 bases. Do you really need it to say UMI in the read name? Some tools, like umi_tools will not be expecting that.

ADD COMMENT
0
Entering edit mode

UMI tools doesn't need it to say UMI, but saying UMI wouldn't be a problem.

ADD REPLY
0
Entering edit mode

Looking back it would have been better to use bcl2fastq to ignore those 11 bases and this is what I can do for future runs. Especially since for a lot of reads, most of those bases are called as Ns and have very low phred scores.

ADD REPLY

Login before adding your answer.

Traffic: 2724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6