Hello,
This may be an odd question, but I recently aligned a fastq file (from mouse total-RNA seq) using STAR, and I wanted to get the total read count for further analysis. I tried to get it from the fastq file using
echo $(cat myfile.fastq|wc -l)/4|bc
And the number was 18,526,266.
I then also thought of trying to get it using the BAM file using
samtools flagstat myfile.bam
And the result was -
23712455 + 0 in total (QC-passed reads + QC-failed reads)
6132434 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
23712455 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
I don't understand how the fastq file gives me 18,526,266 reads and the BAM file gives 23,712,455, which is much larger than the result from the former. What am I missing?
That was quite helpful. Thanks a lot!
So, for using the total # of reads, would it be better to use 23.7M - 6.1M ~ 17.5M or the ~18.5M that fastq gives. I would assume the former would be better as the latter would include reads that didn't match anything?
Yes, the number of primary alignments is the number of reads that were aligned.