I've aligned an RNA-Seq library using Tophat, and I'm confused about the output as there seems to be some disagreement that I can't figure out. The align_summary.txt that Tophat creates looks like this:
Left reads:
Input : 49801387
Mapped : 46258301 (92.9% of input)
of these: 1703360 ( 3.7%) have multiple alignments (103886 have >20)
Right reads:
Input : 49801387
Mapped : 45542088 (91.4% of input)
of these: 1680180 ( 3.7%) have multiple alignments (103887 have >20)
92.2% overall read mapping rate.
Aligned pairs: 44100265
of these: 1622622 ( 3.7%) have multiple alignments
261400 ( 0.6%) are discordant alignments
88.0% concordant pair alignment rate.
And the output of samtools flagstat on the accepted_hits.bam file produced by Tophat is:
$ samtools flagstat accepted_hits.bam
105064647 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
105064647 + 0 mapped (100.00%:-nan%)
105064647 + 0 paired in sequencing
52934350 + 0 read1
52130297 + 0 read2
89670194 + 0 properly paired (85.35%:-nan%)
101093928 + 0 with itself and mate mapped
3970719 + 0 singletons (3.78%:-nan%)
5036220 + 0 with mate mapped to a different chr
248094 + 0 with mate mapped to a different chr (mapQ>=5)
So at the bottom of align_summary it says that we have an 88% concordant pair alignment rate. Since I input 49,801,387 pairs, then 88% of that is 43,825,220.56. It says we have 44,100,265 aligned pairs. We can subtract the 261,400 that are discordant alignments and that gives us 43,838,865. Close, but not the same. How is that 88% calculated?
Furthermore, if we go to the samtools flagstat output for accepted_hits.bam file, it reports 89670194 reads properly paired, or if we divide by two 44,835,097. So we have three numbers for the number of aligned pairs, all of which are similar but not exactly the same. Does anyone have any insight on this?
A related question: When you do RNA-seq alignment and want to report your alignment rate, how do you get that number?