Question

awk or/and cut command for reformatting sequence header identifiers of fastq file

0

Entering edit mode

5.8 years ago

qinglong ▴ 10

Hi there,

I have error-corrected reads with header names changed (four lines per one sequence):

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

I need simplified headers (four lines per one sequence):

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

Please noted the ASCII_BASE 33 format (@ is also available in the quality profile) for my Illumina reads.

Thanks!

fastq header identifiers reformat • 3.7k views

ADD COMMENT • link updated 5.8 years ago by c_dampier ▴ 60 • written 5.8 years ago by qinglong ▴ 10

1

Entering edit mode

5.8 years ago

cpad0112 21k

with gnu-sed ( I copy pasted OP sequence thrice to check the code):

$ sed  '3~4 s/.*/+/; 1~4 s/[^ ]*$//' test.fq

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

ADD COMMENT • link 5.8 years ago by cpad0112 21k

score 3 · Accepted Answer · 2019-03-02

3

Entering edit mode

5.8 years ago

c_dampier ▴ 60

I think this (edited per h.mon comments) awk code will give you the desired output:

awk 'NR%4==1 {$3=""} NR%4==3 {$0=substr($0,1,1)} {print $0}' <my.fq>

The first rule finds each 4th line starting with line 1, and its associated action changes the third field (separated by " ") to empty.

The next rule finds each 4th line starting with line 3, and its associated action takes a sub-string of the line, starting at position 1, and keeps 1 character.

The third (implicit) rule finds every line, and its associated action prints the full line. Note that this final action prints the lines after they have been acted upon by the first two rules.

Original (unsafe) code for reference:

awk '/^@/ {$3=""} /^+/ {$0=substr($0,1,1)} {print $0}' <my.fq>

ADD COMMENT • link 5.8 years ago by c_dampier ▴ 60

2

Entering edit mode

@ and + are part of the character set used to encode quality, so they are not safe to use for header / comment line determination. A better solution is to parse based on line numbers, as most fastq files place sequences in groups of four lines. The original fastq specification allow for sequence and quality line wrapping, but this is nearly nonexistent, and the four-line per record fastq became the de facto standard.

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Thank you, h.mon. I think the edited code will be safer.

ADD REPLY • link 5.8 years ago by c_dampier ▴ 60

1

Entering edit mode

Thanks Christopher! Very clean and clear!

ADD REPLY • link 5.8 years ago by qinglong ▴ 10

0

Entering edit mode

You're welcome, qinglong. I apologize for potentially leading you astray with my original code. Glad h.mon was there to save us.

ADD REPLY • link 5.8 years ago by c_dampier ▴ 60

0

Entering edit mode

I spent half day to figure out a awk command for this purpose; but luckily to have your help.

One stupid thing I met is that awk may not work for compressed files, have to decompress files then manipulate the files. Just a note for others who may be also using this command as.

ADD REPLY • link 5.8 years ago by qinglong ▴ 10

1

Entering edit mode

You can pipe several commands into one, avoiding the need to create intermediary files:

zcat file.fastq.gz | awk 'NR%4==1 {$3=""} NR%4==3 {$0=substr($0,1,1)} {print $0}' | gzip > file_edited.fastq.gz

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Yes, you are right; that is exactly what I have done when I was constructing my in-house pipe.

ADD REPLY • link 5.8 years ago by qinglong ▴ 10