Good morning everybody, I was doing some simple statistics on the 2 fastq file of a sample (forward and reverse). I simply wanted to obtain the number of the total reads for each of the two files, through this command:
zcat $SAMPLE.fwd.fq.gz | wc -l | awk '{print $1/4}'
zcat $SAMPLE.rev.fq.gz | wc -l | awk '{print $1/4}'
the sum of the two resulting numbers is 640000162. I thought that this should be the number of total reads, but when I run samtools flagstat
on the resulting bam ($SAMPLE.final.bam
), here comes the output:
643938688 + 0 in total (QC-passed reads + QC-failed reads)
3938526 + 0 secondary
0 + 0 supplementary
129129835 + 0 duplicates
642067899 + 0 mapped (99.71% : N/A)
640000162 + 0 paired in sequencing
320000081 + 0 read1
320000081 + 0 read2
623066112 + 0 properly paired (97.35% : N/A)
637384172 + 0 with itself and mate mapped
745201 + 0 singletons (0.12% : N/A)
8707820 + 0 with mate mapped to a different chr
4103776 + 0 with mate mapped to a different chr (mapQ>=5)
So, here comes that the reads are actually 643938688 and not 640000162 (paired in sequencing) according to samtools. Now, I am a bit confused about what these values mean, if the total reads are 643938688 or 640000162 (excuse me for my naiveness). Could you give me some hint or some explanation about? Thank you, Giuseppe.
Thanks for your comments Sukjun Kim. How is possible to have more alignments in a BAM file than reads in the FASTQ file?