Hi there,
I have error-corrected reads with header names changed (four lines per one sequence):
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0 BH:changed:3
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
I need simplified headers (four lines per one sequence):
@E00490:475:HVNFVCCXY:5:1101:7710:1836 1:N:0
GGCATGGGCATCGAACTGGCGGTGTAAGGGTTGGGGCTTTGGC
+
<@FFFJJJJJJJJJFFJJJJJJJJFJAJJFFJFJJJJJJF<JFFJJFJJ-<
Please noted the ASCII_BASE 33 format (@ is also available in the quality profile) for my Illumina reads.
Thanks!
@
and+
are part of the character set used to encode quality, so they are not safe to use for header / comment line determination. A better solution is to parse based on line numbers, as most fastq files place sequences in groups of four lines. The original fastq specification allow for sequence and quality line wrapping, but this is nearly nonexistent, and the four-line per record fastq became the de facto standard.Thank you, h.mon. I think the edited code will be safer.
Thanks Christopher! Very clean and clear!
You're welcome, qinglong. I apologize for potentially leading you astray with my original code. Glad h.mon was there to save us.
I spent half day to figure out a awk command for this purpose; but luckily to have your help.
One stupid thing I met is that awk may not work for compressed files, have to decompress files then manipulate the files. Just a note for others who may be also using this command as.
You can pipe several commands into one, avoiding the need to create intermediary files:
Yes, you are right; that is exactly what I have done when I was constructing my in-house pipe.