I have NGS data in .fastq format. Does anyone know how to count reads? Thank you!
I have NGS data in .fastq format. Does anyone know how to count reads? Thank you!
'wc' is faster than awk
#yourfile.fastq
echo $(cat yourfile.fastq|wc -l)/4|bc
#yourfile.fastq.gz
echo $(zcat yourfile.fastq.gz|wc -l)/4|bc
for fasta files: grep -c "^>" file.fasta
for fastq files: grep -c "^@" file.fastq
for fastq files: awk '{s++}END{print s/4}' file.fastq
Again, there can be a quality score @
that can be starting from the first line, this will throw off your counts if you use grep. Better use the line counts and divide it by 4 (even if it takes some time)
@Chenglin: each fastq read comprises of 4 lines, first line is identifier, second line is the sequence, third line is a blank line (starts with +, may sometime have same description as first line) and the last line is quality for the each base in the second line. So if you count the total number of lines, you get number of reads times 4, so you divide it by 4 and you have the actual number of reads.
that's right. indeed quality scores can contain "@" characters, which may appear at the beginning of the quality line: http://en.wikipedia.org/wiki/FASTQ_format#Format. I've edited my answer to include the awk solution that counts all lines and divides them by 4 at the end.
Hi Jorge,
I do see "@" in the quality line. Which command line I can use, please? I am a beginner and I am a little confused.....Sorry.
Best regards,
Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703
Here is the fancy script in bash:
#!/bin/bash if [ $# -lt 1 ] ; then echo "" echo "usage: count_fastq.sh [fastq_file1] <fastq_file2> ..|| *.fastq" echo "counts the number of reads in a fastq file" echo "" exit 0 fi filear=${@}; for i in ${filear[@]} do lines=$(wc -l $i|cut -d " " -f 1) count=$(($lines / 4)) echo -n -e "\t$i : " echo "$count" | \ sed -r ' :L s=([0-9]+)([0-9]{3})=\1,\2= t L' done
Save this as a file (like count_fastq.sh
), then
chmod +x count_fastq.sh
After this you can run it on your file as:
./count_fastq.sh your_file.fastq
Thank you so much, arnstrm. Thank you for sending me the cool scirpt! I am just a beginner. I will try your script.
Best regards,
Chenglin Chai Ph.D., School of Plant, Environmental, and Soil Sciences Louisiana State University Agricultural Center 207C M.B. Sturgis Hall Baton Rouge, LA 70803, USA Tel: (225) 578-9703
for i in `ls *.fastq.gz`; do echo $(zcat ${i} | wc -l)/4|bc; done
Explanation:
For all gzip compressed fastq files, display the number of reads since 4 lines = 1 reads
*Just a good one-liner to see how many reads obtained from something like demultiplexing went
couldn't we just count the number of separations between the sequence and quality score lines?
grep -c "^+$" ./ref.fastq
$readnumber= {(wc-l <name of fastq file>) /4 } *2
LINE_COUNT=$(wc -l filename.fastq) #Counts the number of lines in fastq file
PART=(${LINE_COUNT//\t/ }) # Removes the file name from the output and only returns number of lines
READ_COUNT=$((PART[0]/4)) # Divides the number of lines by 4 which equals the number of reads
A non command line approach for fastq files is to use FastQC, a quality control checking program which is part of many workflows to get a general idea of sequence quality.
Number of reads is listed in the summary statistics at the beginning of the report.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you please explain what is the purpose of | bc???
BC is a calculator. Here I divide the number on ligne by 4.
Try "echo 4+2|bc" to understand