Question

What is the most optimal way to count the nucleotide bases of each fastq file in my directory using UNIX commands?

0

Entering edit mode

8.9 years ago

Tom ▴ 20

I have a bunch of fastq files, and I need to write a one line UNIX command that will write the word count (wc) of how many nucleotides EACH file contains, not the total. It should look like this:

321903 1.fastq

314156 2.fastq

13515 3.fastq

...

and so on.

So far I have

cat *.fastq | awk 'NR%4 == 2 {print $0}'| tr -d '\n' | wc -c

but that doesn't work. I can't find the answer this specific anywhere.

UNIX fastq bash • 5.8k views

ADD COMMENT • link updated 3.1 years ago by Alex Reynolds 36k • written 8.9 years ago by Tom ▴ 20

0

Entering edit mode

I guess you just need to run the command on individual files, in a loop, instead of *.fastq, to get the counts per sample. Other than that I don't see anything wrong.

ADD REPLY • link 8.9 years ago by GouthamAtla 12k

0

Entering edit mode

I am using this command to find nucleotide sequences:

*.fastq; do echo -n ${file}; grep -o [actgnACTGN] $file | wc -l; done;

however, I am getting this error:

-bash: syntax error near unexpected token `do'

please guide how to resolve this issue

ADD REPLY • link 3.1 years ago by Fizzah ▴ 30

0

Entering edit mode

That is not a valid for loop. You have no "for" in it. Just copy any of the code suggestions here properly.

ADD REPLY • link 3.1 years ago by ATpoint 85k

2

Entering edit mode

8.9 years ago

Alex Reynolds 36k

Another option is to use Unix find:

$ find *.fastq -exec sh -c "awk 'NR%4==2' {} | tr -d '\n' | wc -c | sed -e 's/^ *//' | tr -d '\n'; echo '\t{}';" \;

Sample output:

568832   hla.example.illumina.0.1.fastq
568832   hla.example.illumina.0.2.fastq
3102624  hla.example.iontorrent.0.1.fastq

ADD COMMENT • link 8.9 years ago by Alex Reynolds 36k

0

Entering edit mode

can anyone describe the following command in detail:

$ find .fastq -exec sh -c "awk 'NR%4==2' {} | tr -d '\n' | wc -c | sed -e 's/^ //' | tr -d '\n'; echo '\t{}';" \;

ADD REPLY • link 3.1 years ago by Fizzah ▴ 30

1

Entering edit mode

find looks for files that end with .fastq.
On each file it finds, it runs a couple commands on that file, which are specified within two quotation marks (").
That awk command takes the second line of every four lines (second line of every FASTQ record in the fastq file specified by {}), and it pipes that line to a series of additional commands:
The tr command strips the newline from the second line.
The wc command returns the number of characters in that line.
The sed command strips space characters from the character count.
The tr command strips the newline from the result from sed.
The echo command reports the number of characters from wc and the filename from find.

Learning the command line is a powerful skill.

ADD REPLY • link 3.1 years ago by Alex Reynolds 36k

1

Entering edit mode

8.9 years ago

Biomonika (Noolean) 3.2k

You could also name specific nucleotides you are interested in directly:

for file in *.fastq; do echo -n ${file}; grep -o [actgnACTGN] $file | wc -l; done;

ADD COMMENT • link 8.9 years ago by Biomonika (Noolean) 3.2k

Ram · Accepted Answer · 2016-01-19

3

Entering edit mode

8.9 years ago

Pierre Lindenbaum 164k

using only awk.

for F in *.fastq ; do echo -n "$F :" && awk 'NR%4 == 2 {N+=length($0);} END { printf("%d\n",N);}' $F ; done

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

will this code work also for fastq.gz file?

ADD REPLY • link 4.6 years ago by User000 ▴ 710

0

Entering edit mode

of course not. Fixing for gz is easy.

ADD REPLY • link 4.6 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

for F in *.fastq.gz; do echo -n "$F :" && zcat $F | paste - - - - | cut -f 2 | tr -d '\n'| wc -c; done >> res.txt