Question

How to count fastq reads

13

Entering edit mode

9.6 years ago

Chenglin ▴ 260

I have NGS data in .fastq format. Does anyone know how to count reads? Thank you!

sequence next-gen fastq reads • 131k views

ADD COMMENT • link updated 12 months ago by cschu181 ★ 2.8k • written 9.6 years ago by Chenglin ▴ 260

score 40 · Answer 1 · 2018-02-13

40

Entering edit mode

6.8 years ago

sacha ★ 2.4k

'wc' is faster than awk

#yourfile.fastq  
echo $(cat yourfile.fastq|wc -l)/4|bc

#yourfile.fastq.gz
 echo $(zcat yourfile.fastq.gz|wc -l)/4|bc

ADD COMMENT • link 6.8 years ago by sacha ★ 2.4k

0

Entering edit mode

Can you please explain what is the purpose of | bc???

ADD REPLY • link 2.8 years ago by aimanbarki ▴ 20

2

Entering edit mode

BC is a calculator. Here I divide the number on ligne by 4.

Try "echo 4+2|bc" to understand

ADD REPLY • link 2.8 years ago by sacha ★ 2.4k

Ram · Answer 2 · 2015-04-21

8

Entering edit mode

9.6 years ago

Jorge Amigo 14k

for fasta files: grep -c "^>" file.fasta

~~for fastq files: grep -c "^@" file.fastq~~

for fastq files: awk '{s++}END{print s/4}' file.fastq

ADD COMMENT • link 9.6 years ago by Jorge Amigo 14k

0

Entering edit mode

Hmm, that will work for fasta files, not for fastq file!

For fastq files, you can simply count the number of lines and then divide it by 4.

ADD REPLY • link 9.6 years ago by arnstrm ★ 1.9k

0

Entering edit mode

Thank you! Can you explain a little bit, please? why grep -c ">" file.fastq is not working? And why I need to count the number of lines and then divide it by 4?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Chenglin ▴ 260

0

Entering edit mode

fasta sequences start with ">", and fastq identifiers start with "@". counting "@" starts is much faster.

ADD REPLY • link 9.6 years ago by Jorge Amigo 14k

2

Entering edit mode

Again, there can be a quality score @ that can be starting from the first line, this will throw off your counts if you use grep. Better use the line counts and divide it by 4 (even if it takes some time)

@Chenglin: each fastq read comprises of 4 lines, first line is identifier, second line is the sequence, third line is a blank line (starts with +, may sometime have same description as first line) and the last line is quality for the each base in the second line. So if you count the total number of lines, you get number of reads times 4, so you divide it by 4 and you have the actual number of reads.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by arnstrm ★ 1.9k

0

Entering edit mode

that's right. indeed quality scores can contain "@" characters, which may appear at the beginning of the quality line: http://en.wikipedia.org/wiki/FASTQ_format#Format. I've edited my answer to include the awk solution that counts all lines and divides them by 4 at the end.

ADD REPLY • link 9.6 years ago by Jorge Amigo 14k

0

Entering edit mode

Hi Jorge,

I do see "@" in the quality line. Which command line I can use, please? I am a beginner and I am a little confused.....Sorry.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY • link 9.6 years ago by Chenglin ▴ 260

0

Entering edit mode

"@" characters may appear in the quality line, so counting "@" starts doesn't work. instead, as previously stated, you may want to count all lines and divide them by 4. this can be accomplished in several ways, and the awk solution on my answer is just one of them.

ADD REPLY • link 9.6 years ago by Jorge Amigo 14k

0

Entering edit mode

Thank you very much! I will try.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY • link 9.6 years ago by Chenglin ▴ 260

score 5 · Answer 3 · 2015-04-21

5

Entering edit mode

9.6 years ago

arnstrm ★ 1.9k

Here is the fancy script in bash:

#!/bin/bash
if [ $# -lt 1 ] ; then
    echo ""
    echo "usage: count_fastq.sh [fastq_file1] <fastq_file2> ..|| *.fastq"
    echo "counts the number of reads in a fastq file"
    echo ""
    exit 0
fi
 
filear=${@};
for i in ${filear[@]}
do
lines=$(wc -l $i|cut -d " " -f 1)
count=$(($lines / 4))
echo -n -e "\t$i : "
echo "$count"  | \
sed -r '
  :L
  s=([0-9]+)([0-9]{3})=\1,\2=
  t L'
done

Save this as a file (like count_fastq.sh ), then

chmod +x count_fastq.sh

After this you can run it on your file as:

./count_fastq.sh your_file.fastq

ADD COMMENT • link 9.6 years ago by arnstrm ★ 1.9k

0

Entering edit mode

Thank you so much, arnstrm. Thank you for sending me the cool scirpt! I am just a beginner. I will try your script.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY • link 9.6 years ago by Chenglin ▴ 260

0

Entering edit mode

Wow, I tried this and it works!

Can I have your interpretation on the script you showed above? If you are too occupied to share, can you kindly give some rough guidance on how to interpret this script?

Thanks in advance!

ADD REPLY • link 6.3 years ago by u3005579 • 0

0

Entering edit mode

This script didn't work with me Even after several trials May someone advice how to run it to get the count as I am beginner

ADD REPLY • link 3.0 years ago by Hosny.yasmine123 • 0

GenoMax · Answer 4 · 2018-08-10

3

Entering edit mode

6.3 years ago

johnsonnathant ▴ 120

for i in `ls *.fastq.gz`; do echo $(zcat ${i} | wc -l)/4|bc; done

of note: ` are around ls .fastq.gz

Explanation:

For all gzip compressed fastq files, display the number of reads since 4 lines = 1 reads

*Just a good one-liner to see how many reads obtained from something like demultiplexing went

ADD COMMENT • link updated 6.3 years ago by GenoMax 147k • written 6.3 years ago by johnsonnathant ▴ 120

score 2 · Answer 5 · 2020-03-31

2

Entering edit mode

4.7 years ago

sbwiecko ▴ 20

couldn't we just count the number of separations between the sequence and quality score lines?

grep -c "^+$" ./ref.fastq

ADD COMMENT • link 4.7 years ago by sbwiecko ▴ 20

1

Entering edit mode

the only very slight problem I see is that very rare 1 base sequences (trimmed perhaps) with "+" quality would be counted twice, but I love the simplicity of this answer.

ADD REPLY • link 4.7 years ago by Jorge Amigo 14k

0

Entering edit mode

In some older fastq files, the separator repeats the header line (without the leading @).

ADD REPLY • link 12 months ago by cschu181 ★ 2.8k

score 1 · Answer 6 · 2015-04-21

1

Entering edit mode

9.6 years ago

HG ★ 1.2k

$readnumber= {(wc-l <name of fastq file>) /4 } *2

ADD COMMENT • link 9.6 years ago by HG ★ 1.2k

0

Entering edit mode

Thank you so much, I will try.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY • link 9.6 years ago by Chenglin ▴ 260

score 1 · Answer 7 · 2017-09-27

LINE_COUNT=$(wc -l filename.fastq)             #Counts the number of lines in fastq file
PART=(${LINE_COUNT//\t/ })                          # Removes the file name from the output and only returns number of lines
READ_COUNT=$((PART[0]/4))                       # Divides the number of lines by 4 which equals the number of reads

Ram · Answer 8 · 2015-04-21

0

Entering edit mode

9.6 years ago

CraigM ▴ 90

A non command line approach for fastq files is to use FastQC, a quality control checking program which is part of many workflows to get a general idea of sequence quality.

Number of reads is listed in the summary statistics at the beginning of the report.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by CraigM ▴ 90

0

Entering edit mode

Thank you.

Best regards,

Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703

ADD REPLY • link 9.6 years ago by Chenglin ▴ 260

score 0 · Answer 9 · 2015-04-21

0

Entering edit mode

9.6 years ago

Daniel ★ 4.0k

If you're looking at fastq files for the first time, you will probably want to do more than just counting the number of reads, and for that I would recommend downloading FASTQC.

Really good tool for assessing the state of your sequencing data.

ADD COMMENT • link 9.6 years ago by Daniel ★ 4.0k