Question

How can read-count from BAM file be greater than than from fastq file?

0

Entering edit mode

3.7 years ago

c_u ▴ 530

Hello,

This may be an odd question, but I recently aligned a fastq file (from mouse total-RNA seq) using STAR, and I wanted to get the total read count for further analysis. I tried to get it from the fastq file using

echo $(cat myfile.fastq|wc -l)/4|bc

And the number was 18,526,266.

I then also thought of trying to get it using the BAM file using

samtools flagstat myfile.bam

And the result was -

23712455 + 0 in total (QC-passed reads + QC-failed reads)
6132434 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
23712455 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I don't understand how the fastq file gives me 18,526,266 reads and the BAM file gives 23,712,455, which is much larger than the result from the former. What am I missing?

rna-seq • 2.5k views

ADD COMMENT • link updated 3.7 years ago by i.sudbery 21k • written 3.7 years ago by c_u ▴ 530

score 5 · Accepted Answer · 2021-07-19

5

Entering edit mode

3.7 years ago

i.sudbery 21k

6M of your ~24M reads are secondary alignments (24M - 6M ~ 18M). The most common reason (but not only reason) for a read to have a secondary alignment is that the read aligns to more than one location in the genome, and one of them has been assigned "primary" at random and all the others assigned "secondary" (multimapped reads). An alternative reason might be that the first part of the read maps to one genome location, and the other part of the read to an other genome location (a chimeric read).

ADD COMMENT • link 3.7 years ago by i.sudbery 21k

0

Entering edit mode

That was quite helpful. Thanks a lot!

ADD REPLY • link 3.7 years ago by c_u ▴ 530

0

Entering edit mode

So, for using the total # of reads, would it be better to use 23.7M - 6.1M ~ 17.5M or the ~18.5M that fastq gives. I would assume the former would be better as the latter would include reads that didn't match anything?