Sequence Number Count In Fastq.Gz File

Entering edit mode

13.8 years ago

Bioscientist ★ 1.7k

Do we have any easy, fast way to know how many sequences contained in paired-end fastq.gz file? One simple way I think to calculate is to count the # of lines in fastq file divided by 4. (Four line for one sequence)

But I'm afraid it'll be quite slow for the calculation due to the large size of fastq.gz

What about the sequences in .sai/.sam file? (Because I want to guarantee the quality by comparing sequences in fastq file, and how many sequences been processed in .sai/.sam file)

The only way I can think of, is that we have the error report of .sai/.sam file generation, in which we get the information like "10000000 sequences have been processed". I can print this line by python, but how can I print out this number out of the line?

thx

To Ale: thx, but what's the rationale for your first command? I mean about ^@ Just look at the fastq.gz file below, where is ^@ ? I see many @, but not ^

@IL11_266:1:67:0:838/1
AAAAAAAAAAAAAAAAAAAAANNANNNANANAAAAA
+
@@AAAAAAA@AAAAAABAAAB&&B&%%B%>%=<>>>
@IL11_266:1:67:3:594/1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAC
+
@@AAAAAAA@AAA?AA@A@AB?AB@AAB7>&=><A2

Edit again: Also I try both of the commands, and find the result is 2-fold difference. I compared with .sai file error report, which said 853813 sequences have been processed, then should say 'zgrep' command double counts.

[x@hpcc01 bwa2]$ zcat ERR002356_1.recal.fastq.gz | echo $((`wc -l`/4))
853813
[x@hpcc01 bwa2]$ zgrep -c '^@' ERR002356_1.recal.fastq.gz
1643596

EDIT: THX. But I'm not dealing with one file. Plz look at below:

#!/usr/bin/python

import os,sys,re
import gzip
import commands

path = "/home/xug/nearline"

for file in os.listdir(path):
  if re.match('.*\.recal.fastq.gz', file):
    fullpath = os.path.join(path, file)
    result = commands.getoutput('zcat fullpath |wc -l')
    numseqs = int(result)/4.0
  print numseqs

Problem is, I define 'fullpath' here for all fastq files, but after being put under ' ', seems this fullpath doesn't work...How can I solve this problem? thx

Problem solves....fullpath is variable. ..so 'zcat '+fullpath+' |wc -l'

counts sequence fastq • 73k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 13.8 years ago by Bioscientist ★ 1.7k

Entering edit mode

@bioscientist, this is a very good point - [z]grep cannot be used (like for fasta) because @ and + can also occur in the quality line.

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 13.8 years ago by Aleksandr Levchuk 3.2k

Entering edit mode

^ means --> begins with ...

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 13.8 years ago by Aleksandr Levchuk 3.2k

Entering edit mode

The following should do the trick:

zgrep -c '^@.*/[0-9]$' fastq.gz

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 13.8 years ago by Andreas ★ 2.5k

Entering edit mode

In general, checking for a terminal digit will not work because Sanger-encoded Fastq qualities include characters in the range ASCII 33 to 126, that is to say the qualities may contain digits. Your regex will still potentially match quality lines.

ADD REPLY • link 13.8 years ago by biobot 0.0.77.a.1099 6.2k

Entering edit mode

# to count the number of raw sequences by number of lines divided by 4, in a gzipped fastq file,
echo $(($(zcat $file | wc -l) / 4))

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 2.2 years ago by Sergei • 0

Entering edit mode

Did you even read the post? The first two sentences of the text says:

Do we have any easy, fast way to know how many sequences contained in paired-end fastq.gz file? One simple way I think to calculate is to count the # of lines in fastq file divided by 4. (Four line for one sequence)

So your answer is something that OP knew and discounted right at the beginning. Plus, the top two answers specify this and even parallelize it. Please don't add an answer to a really old post unless you're actually contributing something significant. I'm moving your "answer" to a comment, it'll probably be deleted soon.

ADD REPLY • link 2.2 years ago by Ram 45k

Entering edit mode

13.8 years ago

Aleksandr Levchuk 3.2k

Your method should also work and it should be pretty fast:

zcat my.fastq.gz | echo $((`wc -l`/4))

For SAM it's:

cat my.sam | grep -v '^ *@' | wc -l

I'm not sure how to interact with .sai files.

NOTE1: You can call the above Bash commands from Python:

import commands
result = commands.getoutput('zcat my.fastq.gz | wc -l')
numseqs = int(result) / 4.0

If you are doing Bioinfromatics you want to have the GNU toolset which includes the Bash shell (it's default on Linux and Mac, on Win you can install Cygwin)

NOTE2: I'm assuming that all input files have a newline at the end. It's easy to check: run

zcat my.fastq.gz | tail -n 1 | cat -E

if you see a dollar sign ($) then everything is OK. Conversely, you may have generated the files on Windows or from a newbie/ignorant code that does not place a newline at the end of the file, then you will not see a the dollar sign. In that case sanitize the files like this:

zcat my.fastq.gz  | ruby -pne '$_.chomp!; $_ << "\n"' | gzip > good.fastq.gz

NOTE3: I'm assuming that your FASTQ files don't wrap lines. One line per read. You can test like this:

zcat my.fastq.gz  | ruby -e '
skip_line = false
last_char = "+"

while gets
  if skip_line
    skip_line = false
    next 
  end

  if $_ =~ /^@/ and last_char == "+"
    last_char = "@"

  elsif $_ =~ /^\+/ and last_char == "@"
    last_char = "+"

  else
    STDERR.puts "WARNING: fastq lines wrap at line #{$.}" 
    exit(1)
  end

  skip_line = true
end

STDERR.puts "fastq lines do not wrap"
'

NOTE4: Instead of all of the above you can use a filesystem with online compression (i.e. ZFS) and do all the analysis on fastq files instead of fastq.gz

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 13.8 years ago by Aleksandr Levchuk 3.2k

Entering edit mode

Uh, you are making the assumption that the Fastq file isn't wrapping the sequences. Although this is uncommon, it's allowed within the standard. Also, SAM files may have a separate header section, you need to make sure only to count the aligned reads.

ADD REPLY • link 13.8 years ago by Ketil 4.2k

Entering edit mode

thx Ale, but this is shell command, right? I want to do this within Python script.

ADD REPLY • link 13.8 years ago by Bioscientist ★ 1.7k

Entering edit mode

yeah the line counting thing to easily flummoxed by missing end-of-file carriage returns (bad but common) and the SAM thing is just outright wrong in headered SAM files.

ADD REPLY • link 13.8 years ago by Jeremy Leipzig 23k

Entering edit mode

@Jeremy, @Ketil - I took into account your comments.

ADD REPLY • link 13.8 years ago by Aleksandr Levchuk 3.2k

Entering edit mode

10.7 years ago

summerela ▴ 190

I'm a bit late to the party, but here's an answer using gnu parallel that should do the trick.. and quickly:

parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

This assumes your fastq files are gzipped and in standard fastq format.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by summerela ▴ 190

Entering edit mode

tks summerela \o/

ADD REPLY • link 9.2 years ago by renatopuga ▴ 10

Entering edit mode

10.7 years ago

Matt Shirley 10k

Here's a method to count the number of reads from the BAM index file:

	from subprocess import Popen PIPE

	def bam_read_count(bamfile):
	""" Return a tuple of the number of mapped and unmapped reads in a bam file """
	p = Popen(['samtools', 'idxstats', bamfile], stdout=PIPE)
	mapped = 0
	unmapped = 0
	for line in p.stdout:
	rname, rlen, nm, nu = line.rstrip().split()
	mapped += int(nm)
	unmapped += int(nu)
	return (mapped, unmapped)

view raw readcount.py hosted with ❤ by GitHub

The speed of this method is independent of the number of reads in your BAM file.

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 10.7 years ago by Matt Shirley 10k

Entering edit mode

10.7 years ago

Alex Reynolds 36k

I posted a solution here that uses GNU Parallel: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by Alex Reynolds 36k