What is the most optimal way to count the nucleotide bases of each fastq file in my directory using UNIX commands?
3
0
Entering edit mode
9.0 years ago
Tom ▴ 20

I have a bunch of fastq files, and I need to write a one line UNIX command that will write the word count (wc) of how many nucleotides EACH file contains, not the total. It should look like this:

321903 1.fastq

314156 2.fastq

13515 3.fastq

...

and so on.

So far I have

cat *.fastq | awk 'NR%4 == 2 {print $0}'| tr -d '\n' | wc -c

but that doesn't work. I can't find the answer this specific anywhere.

UNIX fastq bash • 5.9k views
ADD COMMENT
0
Entering edit mode

I guess you just need to run the command on individual files, in a loop, instead of *.fastq, to get the counts per sample. Other than that I don't see anything wrong.

ADD REPLY
0
Entering edit mode

I am using this command to find nucleotide sequences:

*.fastq; do echo -n ${file}; grep -o [actgnACTGN] $file | wc -l; done;

however, I am getting this error:

-bash: syntax error near unexpected token `do'

please guide how to resolve this issue

ADD REPLY
0
Entering edit mode

That is not a valid for loop. You have no "for" in it. Just copy any of the code suggestions here properly.

ADD REPLY
3
Entering edit mode
9.0 years ago

using only awk.

for F in *.fastq ; do echo -n "$F :" && awk 'NR%4 == 2 {N+=length($0);} END { printf("%d\n",N);}' $F ; done
ADD COMMENT
0
Entering edit mode

will this code work also for fastq.gz file?

ADD REPLY
0
Entering edit mode

of course not. Fixing for gz is easy.

ADD REPLY
1
Entering edit mode
for F in *.fastq.gz; do echo -n "$F :" && zcat $F | paste - - - - | cut -f 2 | tr -d '\n'| wc -c; done >> res.txt

Something like this? A mix of your code and this one

ADD REPLY
0
Entering edit mode

could you give me some hints? thanks

ADD REPLY
2
Entering edit mode
9.0 years ago

Another option is to use Unix find:

$ find *.fastq -exec sh -c "awk 'NR%4==2' {} | tr -d '\n' | wc -c | sed -e 's/^ *//' | tr -d '\n'; echo '\t{}';" \;

Sample output:

568832   hla.example.illumina.0.1.fastq
568832   hla.example.illumina.0.2.fastq
3102624  hla.example.iontorrent.0.1.fastq
ADD COMMENT
0
Entering edit mode

can anyone describe the following command in detail:

$ find .fastq -exec sh -c "awk 'NR%4==2' {} | tr -d '\n' | wc -c | sed -e 's/^ //' | tr -d '\n'; echo '\t{}';" \;

ADD REPLY
1
Entering edit mode
  1. find looks for files that end with .fastq.
  2. On each file it finds, it runs a couple commands on that file, which are specified within two quotation marks (").
  3. That awk command takes the second line of every four lines (second line of every FASTQ record in the fastq file specified by {}), and it pipes that line to a series of additional commands:
  4. The tr command strips the newline from the second line.
  5. The wc command returns the number of characters in that line.
  6. The sed command strips space characters from the character count.
  7. The tr command strips the newline from the result from sed.
  8. The echo command reports the number of characters from wc and the filename from find.

Learning the command line is a powerful skill.

ADD REPLY
1
Entering edit mode
9.0 years ago

You could also name specific nucleotides you are interested in directly:

for file in *.fastq; do echo -n ${file}; grep -o [actgnACTGN] $file | wc -l; done;

ADD COMMENT

Login before adding your answer.

Traffic: 1603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6