awk or/and cut command for reformatting sequence header identifiers of fastq file
2
0
Entering edit mode
5.8 years ago
qinglong ▴ 10

Hi there,

I have error-corrected reads with header names changed (four lines per one sequence):

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

I need simplified headers (four lines per one sequence):

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<

Please noted the ASCII_BASE 33 format (@ is also available in the quality profile) for my Illumina reads.

Thanks!

fastq header identifiers reformat • 3.7k views
ADD COMMENT
3
Entering edit mode
5.8 years ago
c_dampier ▴ 60

I think this (edited per h.mon comments) awk code will give you the desired output:

awk 'NR%4==1 {$3=""} NR%4==3 {$0=substr($0,1,1)} {print $0}' <my.fq>

The first rule finds each 4th line starting with line 1, and its associated action changes the third field (separated by " ") to empty.

The next rule finds each 4th line starting with line 3, and its associated action takes a sub-string of the line, starting at position 1, and keeps 1 character.

The third (implicit) rule finds every line, and its associated action prints the full line. Note that this final action prints the lines after they have been acted upon by the first two rules.

Original (unsafe) code for reference:

awk '/^@/ {$3=""} /^+/ {$0=substr($0,1,1)} {print $0}' <my.fq>
ADD COMMENT
2
Entering edit mode

@ and + are part of the character set used to encode quality, so they are not safe to use for header / comment line determination. A better solution is to parse based on line numbers, as most fastq files place sequences in groups of four lines. The original fastq specification allow for sequence and quality line wrapping, but this is nearly nonexistent, and the four-line per record fastq became the de facto standard.

ADD REPLY
0
Entering edit mode

Thank you, h.mon. I think the edited code will be safer.

ADD REPLY
1
Entering edit mode

Thanks Christopher! Very clean and clear!

ADD REPLY
0
Entering edit mode

You're welcome, qinglong. I apologize for potentially leading you astray with my original code. Glad h.mon was there to save us.

ADD REPLY
0
Entering edit mode

I spent half day to figure out a awk command for this purpose; but luckily to have your help.

One stupid thing I met is that awk may not work for compressed files, have to decompress files then manipulate the files. Just a note for others who may be also using this command as.

ADD REPLY
1
Entering edit mode

You can pipe several commands into one, avoiding the need to create intermediary files:

zcat file.fastq.gz | awk 'NR%4==1 {$3=""} NR%4==3 {$0=substr($0,1,1)} {print $0}' | gzip > file_edited.fastq.gz
ADD REPLY
0
Entering edit mode

Yes, you are right; that is exactly what I have done when I was constructing my in-house pipe.

ADD REPLY
1
Entering edit mode
5.8 years ago

with gnu-sed ( I copy pasted OP sequence thrice to check the code):

$ sed  '3~4 s/.*/+/; 1~4 s/[^ ]*$//' test.fq

@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
ADD COMMENT

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6