Question

Counting Number Of Bases In A Fastq File

3

Entering edit mode

11.4 years ago

DoubleDecker ▴ 180

I am sure there must be a tool out there that does it and does it fast? Parsing each file with a custom script is an option but I have big files and want something efficient. Too bad FastQC does not seem to provide this option.

fastq ngs • 42k views

ADD COMMENT • link updated 2.4 years ago by Der • 0 • written 11.4 years ago by DoubleDecker ▴ 180

1

Entering edit mode

Is the read length the same for all entries in the fastq file? In that case, you could just count the number of lines with a unix command like 'wc -l' which is probably as fast as it gets. Then divide the result by 4 and multiply by the read length.

ADD REPLY • link 11.4 years ago by KCC ★ 4.1k

Ram · Answer 1 · 2015-01-07

15

Entering edit mode

10.0 years ago

Lars ★ 1.1k

zcat file.fq.gz | paste - - - - | cut -f2 | wc -c

zcat: print gzipped file (if the file is not zipped, just use cat)

paste - - - -: print four consecutive lines in one row (tab delimited)

cut -f2: print only the second column (after paste this is the second line of the fastq-format, meaning the sequence)

wc -c: count the characters

ADD COMMENT • link 10.0 years ago by Lars ★ 1.1k

12

Entering edit mode

This will also count newline characters, remove those first.

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by bv213 ▴ 120

Ram · Answer 2 · 2013-08-02

6

Entering edit mode

11.4 years ago

Gabriel R. ★ 2.9k

awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($0);}}END{print sum;}'  file.fq

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 11.4 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Thanks but I am a little bit unsure if this works 100% properly... As a test, I had a look at a raw fastq file which I run with FastQC. FastQC claims that all the reads are 100 bp, so I calculated the number of bases by simply multiplying the number of reads by 100. The number I get from doing so is 676,539bp (994,675,300 ver 993,998,761) bigger than the one from running your awk command on the file. Do you have any idea about the source of this discrepancy?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 11.4 years ago by DoubleDecker ▴ 180

0

Entering edit mode

I only tested it on a small fastq, try

awk '{ if(NR%4==2){print length($0);} }' file.fq

pipe that into uniq -c to check if indeed every seq is 100bp in length

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 11.4 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

It is important to mention that this code assumes that each sequence in the fastq file uses EXACTLY 4 lines, otherwise it will not work.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 11.4 years ago by Eric Normandeau 11k

1

Entering edit mode

Isn't this the specification for the FASTQ format?

ADD REPLY • link 11.4 years ago by Alex Reynolds 36k

0

Entering edit mode

This may be the reason for my missing bases - I ran the second awk command suggested by Samuel and all my reads are 100 bp long.

ADD REPLY • link 11.4 years ago by DoubleDecker ▴ 180

0

Entering edit mode

then you should not have any discrepency. Do you have white spaces or something ?

ADD REPLY • link 11.4 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

A bit tidier:

awk 'NR%4==2 {sum += length($0)} END {print sum}' file.fq

ADD REPLY • link 2.4 years ago by Der • 0

score 3 · Answer 3 · 2013-08-02

3

Entering edit mode

11.4 years ago

Eric Normandeau 11k

I'd probably go with something simple like:

grep -E '^[ACTGN]+$' | perl -pe 's/[[:space:]]//g' | wc -c

The assumption here is that you want to count all characters on all lines that contain only one of ACTG or N.

You can also use fastx_quality_stats from the fastx toolkit. It reports the total number of bases, among other things.

ADD COMMENT • link 11.4 years ago by Eric Normandeau 11k

3

Entering edit mode

Hi Eric, if you avoid the extended regex, it is twice faster (on my test file). If also think using the -m option of wc would be better (at least semantically), but it slows down the computation: grep "^[ACGTN]" test.fastq | tr -d "\n" | wc -m

ADD REPLY • link 11.4 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Cool, I never thought about the added cost of using extended regular expressions. However, your code is not as safe if you test only for the first character on the line. Fastq files vary and some may even have the name of the sequences starting with one of ACTGN, in which case you would end up counting the characters of unwanted lines. I tend to go with safer over faster. I find it actually saves a LOT of time down the line.

ADD REPLY • link 11.4 years ago by Eric Normandeau 11k

0

Entering edit mode

You're right. The problem here is the lack of normalization for fastq files. Parsing fastq should be much more straighforward.

ADD REPLY • link 11.4 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Thanks it works. I'd like just add that - since most frequently fastq files are gzipped - zgrep is alternative in such instances.

ADD REPLY • link 8.4 years ago by boczniak767 ▴ 870

0

Entering edit mode

Thanks, Eric - your snippet does the job well with my raw dataset .

ADD REPLY • link 11.4 years ago by DoubleDecker ▴ 180

Ram · Answer 4 · 2015-01-05

1

Entering edit mode

10.0 years ago

ramarquezo ▴ 10

This assumption is not correct because many quality lines start by letter like CGAT or N, then you are adding to the count the characters from the lines of the quality that start with this values, remember that the ASCII code include all the letters of the alphabet!!!! The wak Samuel's script works perfectly, sure that the problem is that maybe you are not including the quality data in your data test or are not included in the correct line, because all the fastq files have each 4 lines the nucleotide sequence....

ADD COMMENT • link 10.0 years ago by ramarquezo ▴ 10

2

Entering edit mode

Correct; the only safe approach is to require 4-line properly-formatted fastq. reformat.sh uses this approach, and will additionally verify certain properties like @ and # symbols being in the right places, and the number of bases equaling the number of quality scores, to help ensure the input is valid.

$ reformat.sh in=reads.fq
Input:                          4000000 reads           297852680 bases
Time:                           2.339 seconds.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.0 years ago by Brian Bushnell 20k

score 0 · Answer 5 · 2015-08-30

0

Entering edit mode

9.3 years ago

billzt ▴ 20

Please try this: https://github.com/billzt/readfq Smart and Fast!!!

ADD COMMENT • link 9.3 years ago by billzt ▴ 20