Do we have any easy, fast way to know how many sequences contained in paired-end fastq.gz file? One simple way I think to calculate is to count the # of lines in fastq file divided by 4. (Four line for one sequence)
But I'm afraid it'll be quite slow for the calculation due to the large size of fastq.gz
What about the sequences in .sai/.sam file? (Because I want to guarantee the quality by comparing sequences in fastq file, and how many sequences been processed in .sai/.sam file)
The only way I can think of, is that we have the error report of .sai/.sam file generation, in which we get the information like "10000000 sequences have been processed". I can print this line by python, but how can I print out this number out of the line?
thx
To Ale: thx, but what's the rationale for your first command? I mean about ^@ Just look at the fastq.gz file below, where is ^@ ? I see many @, but not ^
@IL11_266:1:67:0:838/1
AAAAAAAAAAAAAAAAAAAAANNANNNANANAAAAA
+
@@AAAAAAA@AAAAAABAAAB&&B&%%B%>%=<>>>
@IL11_266:1:67:3:594/1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAC
+
@@AAAAAAA@AAA?AA@A@AB?AB@AAB7>&=><A2
Edit again: Also I try both of the commands, and find the result is 2-fold difference. I compared with .sai file error report, which said 853813 sequences have been processed, then should say 'zgrep' command double counts.
[x@hpcc01 bwa2]$ zcat ERR002356_1.recal.fastq.gz | echo $((`wc -l`/4))
853813
[x@hpcc01 bwa2]$ zgrep -c '^@' ERR002356_1.recal.fastq.gz
1643596
EDIT: THX. But I'm not dealing with one file. Plz look at below:
#!/usr/bin/python
import os,sys,re
import gzip
import commands
path = "/home/xug/nearline"
for file in os.listdir(path):
if re.match('.*\.recal.fastq.gz', file):
fullpath = os.path.join(path, file)
result = commands.getoutput('zcat fullpath |wc -l')
numseqs = int(result)/4.0
print numseqs
Problem is, I define 'fullpath' here for all fastq files, but after being put under ' ', seems this fullpath doesn't work...How can I solve this problem? thx
Problem solves....fullpath is variable. ..so 'zcat '+fullpath+' |wc -l'
@bioscientist, this is a very good point - [z]grep cannot be used (like for fasta) because
@
and+
can also occur in the quality line.^
means --> begins with ...The following should do the trick:
In general, checking for a terminal digit will not work because Sanger-encoded Fastq qualities include characters in the range ASCII 33 to 126, that is to say the qualities may contain digits. Your regex will still potentially match quality lines.
Did you even read the post? The first two sentences of the text says:
So your answer is something that OP knew and discounted right at the beginning. Plus, the top two answers specify this and even parallelize it. Please don't add an answer to a really old post unless you're actually contributing something significant. I'm moving your "answer" to a comment, it'll probably be deleted soon.