Quick One Liner For Fastq Header Renaming
5
3
Entering edit mode
11.6 years ago
Josh Herr 5.8k

My annual question here: I received 17 large (each between 5.27 and 7.89 Gb) FASTQ files of transcriptome data from a collaborator. What they sent me has no header information for any of the sequences in the files. Here's the (truncated) first two sequences of the first file:

@No name
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@No name
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF

Before I QC and do any downstream analyses I would like to give each sequence a number so I can reference each later (i.e. controls and treatments). I want something like this:

@1_1
CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA
+
????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH
@1_2
CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG
+
BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF

etc., etc., for the millions of sequences... but I need to make sure all the sequence data has individual header names that correspond to that particular sequence (so when I rename the next file it needs to be unique for that file).

This shouldn't be that hard, but I googled around on the forum and elsewhere and can't find a solution. I've tried to grep the files and I get a memory error. I tried to use bioawk (with the -c flag) on the giant gzip'ed file but the renaming is erratic.

Admittedly, I really need to work on my awk skills. Can anyone provide a quick awk one liner to rename these headers in the unzipped files that would use minimal memory? Thanks so much for your help.

fastq awk • 22k views
ADD COMMENT
16
Entering edit mode
11.6 years ago

Pierre's answer is correct, but I think there is a simpler Awk solution:

zcat file.fastq.gz | awk '{print (NR%4 == 1) ? "@1_" ++i : $0}' | gzip -c > another.fastq.gz

With Awk, you can use conditional expressions (expressions with 3 operands, like in C). It reads like that (A ? B : C): if A is true, then B, else C. In that case, we print the result of the conditional expression to the standard output. With Awk, you do not need to declare your variables (automatically set to 0). Pierre's solution to set the first i to 1 can be improved by using pre-incrementation (++i) instead of a post-incrementation (i++).

ADD COMMENT
2
Entering edit mode

thank you for improving my awk :-)

ADD REPLY
0
Entering edit mode

Thanks for improving my awk too!

ADD REPLY
9
Entering edit mode
11.6 years ago

NR is the number of rows. $0 is the current whole line.

gunzip -c file.fastq.gz |  awk '{if(NR%4==1) $0=sprintf("@1_%d",(1+i++)); print;}' | gzip -c > another.fastq.gz
ADD COMMENT
0
Entering edit mode

Thanks! This worked like a charm!

ADD REPLY
0
Entering edit mode

6.4 years later still helpful, thanks :)

ADD REPLY
4
Entering edit mode
11.6 years ago

With Biopieces www.biopieces.org):

read_fastq -i in.fq | add_ident -k SEQ_NAME -p 1_ | write_fastq -o out.fq -x
ADD COMMENT
0
Entering edit mode

Thanks for this and biopieces!

ADD REPLY
2
Entering edit mode
11.6 years ago
bioinfo ▴ 840

I guess we can do it with the fastx toolkit as well. Here it is:

$ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

   [-n TYPE]    = rename type:
          SEQ - use the nucleotides sequence as the name.
          COUNT - use simply counter as the name.
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
ADD COMMENT
0
Entering edit mode

Thanks! I was not aware of this.

ADD REPLY
1
Entering edit mode
5.2 years ago
ATpoint 85k

Very simple with seqtk rename, see https://github.com/lh3/seqtk

ADD COMMENT
0
Entering edit mode

Thanks for pointing that one out for posterity! When I asked the question 6.4 years ago the seqtk tool did not exist!

ADD REPLY

Login before adding your answer.

Traffic: 1344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6