Counting Number Of Bases In A Fastq File
5
3
Entering edit mode
11.4 years ago
DoubleDecker ▴ 180

I am sure there must be a tool out there that does it and does it fast? Parsing each file with a custom script is an option but I have big files and want something efficient. Too bad FastQC does not seem to provide this option.

fastq ngs • 42k views
ADD COMMENT
1
Entering edit mode

Is the read length the same for all entries in the fastq file? In that case, you could just count the number of lines with a unix command like 'wc -l' which is probably as fast as it gets. Then divide the result by 4 and multiply by the read length.

ADD REPLY
15
Entering edit mode
10.0 years ago
Lars ★ 1.1k
zcat file.fq.gz | paste - - - - | cut -f2 | wc -c

zcat: print gzipped file (if the file is not zipped, just use cat)

paste - - - -: print four consecutive lines in one row (tab delimited)

cut -f2: print only the second column (after paste this is the second line of the fastq-format, meaning the sequence)

wc -c: count the characters

ADD COMMENT
12
Entering edit mode

This will also count newline characters, remove those first.

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
ADD REPLY
6
Entering edit mode
11.4 years ago
Gabriel R. ★ 2.9k
awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($0);}}END{print sum;}'  file.fq
ADD COMMENT
0
Entering edit mode

Thanks but I am a little bit unsure if this works 100% properly... As a test, I had a look at a raw fastq file which I run with FastQC. FastQC claims that all the reads are 100 bp, so I calculated the number of bases by simply multiplying the number of reads by 100. The number I get from doing so is 676,539bp (994,675,300 ver 993,998,761) bigger than the one from running your awk command on the file. Do you have any idea about the source of this discrepancy?

ADD REPLY
0
Entering edit mode

I only tested it on a small fastq, try

awk '{ if(NR%4==2){print length($0);} }' file.fq

pipe that into uniq -c to check if indeed every seq is 100bp in length

ADD REPLY
0
Entering edit mode

It is important to mention that this code assumes that each sequence in the fastq file uses EXACTLY 4 lines, otherwise it will not work.

ADD REPLY
1
Entering edit mode

Isn't this the specification for the FASTQ format?

ADD REPLY
0
Entering edit mode

This may be the reason for my missing bases - I ran the second awk command suggested by Samuel and all my reads are 100 bp long.

ADD REPLY
0
Entering edit mode

then you should not have any discrepency. Do you have white spaces or something ?

ADD REPLY
0
Entering edit mode

A bit tidier:

awk 'NR%4==2 {sum += length($0)} END {print sum}' file.fq
ADD REPLY
3
Entering edit mode
11.4 years ago

I'd probably go with something simple like:

grep -E '^[ACTGN]+$' | perl -pe 's/[[:space:]]//g' | wc -c

The assumption here is that you want to count all characters on all lines that contain only one of ACTG or N.

You can also use fastx_quality_stats from the fastx toolkit. It reports the total number of bases, among other things.

ADD COMMENT
3
Entering edit mode

Hi Eric, if you avoid the extended regex, it is twice faster (on my test file). If also think using the -m option of wc would be better (at least semantically), but it slows down the computation: grep "^[ACGTN]" test.fastq | tr -d "\n" | wc -m

ADD REPLY
0
Entering edit mode

Cool, I never thought about the added cost of using extended regular expressions. However, your code is not as safe if you test only for the first character on the line. Fastq files vary and some may even have the name of the sequences starting with one of ACTGN, in which case you would end up counting the characters of unwanted lines. I tend to go with safer over faster. I find it actually saves a LOT of time down the line.

ADD REPLY
0
Entering edit mode

You're right. The problem here is the lack of normalization for fastq files. Parsing fastq should be much more straighforward.

ADD REPLY
0
Entering edit mode

Thanks it works. I'd like just add that - since most frequently fastq files are gzipped - zgrep is alternative in such instances.

ADD REPLY
0
Entering edit mode

Thanks, Eric - your snippet does the job well with my raw dataset .

ADD REPLY
1
Entering edit mode
10.0 years ago
ramarquezo ▴ 10

This assumption is not correct because many quality lines start by letter like CGAT or N, then you are adding to the count the characters from the lines of the quality that start with this values, remember that the ASCII code include all the letters of the alphabet!!!! The wak Samuel's script works perfectly, sure that the problem is that maybe you are not including the quality data in your data test or are not included in the correct line, because all the fastq files have each 4 lines the nucleotide sequence....

ADD COMMENT
2
Entering edit mode

Correct; the only safe approach is to require 4-line properly-formatted fastq. reformat.sh uses this approach, and will additionally verify certain properties like @ and # symbols being in the right places, and the number of bases equaling the number of quality scores, to help ensure the input is valid.

$ reformat.sh in=reads.fq
Input:                          4000000 reads           297852680 bases
Time:                           2.339 seconds.
ADD REPLY
0
Entering edit mode
9.3 years ago
billzt ▴ 20

Please try this: https://github.com/billzt/readfq Smart and Fast!!!

ADD COMMENT

Login before adding your answer.

Traffic: 1948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6