I have some RNA-Seq data (paired-end reads) which I have aligned using TopHat. This is how the align summary look like:
Left reads:
Input : 25671258
Mapped : 22149823 (86.3% of input)
of these: 2005259 ( 9.1%) have multiple alignments (200624 have >20)
Right reads:
Input : 25671258
Mapped : 21866868 (85.2% of input)
of these: 1977056 ( 9.0%) have multiple alignments (199383 have >20)
85.7% overall read mapping rate.
Aligned pairs: 21161013
of these: 1903746 ( 9.0%) have multiple alignments
801300 ( 3.8%) are discordant alignments
79.3% concordant pair alignment rate.
When I run
samtools flagstat accepted_hits.bam
This is the result:
66612729 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
66612729 + 0 mapped (100.00%:nan%)
66612729 + 0 paired in sequencing
33517595 + 0 read1
33095134 + 0 read2
34458604 + 0 properly paired (51.73%:nan%)
64054464 + 0 with itself and mate mapped
2558265 + 0 singletons (3.84%:nan%)
13192728 + 0 with mate mapped to a different chr
402500 + 0 with mate mapped to a different chr (mapQ>=5)
I don't understand why the percentage of pair alignment given by Tophat does not correspond to the percentage of properly paired reads. Besides this, I do find that in the bam file that are paired reads mapped in different chromosomes. Could you please help understand this?