There is a file called align_summary.txt
in the tophat folder (generated by running tophat) which says:
Left reads:
Input: 128979165
Mapped: 98314933 (76.2% of input)
of these: 11898655 (12.1%) have multiple alignments (9004 have >20)
Right reads:
Input: 128979165
Mapped: 95536410 (74.1% of input)
of these: 10769172 (11.3%) have multiple alignments (2289 have >20)
75.1% overall read alignment rate.
Aligned pairs: 92923521
of these: 8913959 ( 9.6%) have multiple alignments
and: 1899417 ( 2.0%) are discordant alignments
70.6% concordant pair alignment rate.
Does what it says at the end "70.6% concordant pair alignment rate" mean that 70.6% of pair-end reads mapped uniquely (single match) as a pair? And are these 70.6% of paired reads is what included in the accepted_hits.bam
?
What about splice junctions which mapped uniquely to transcriptome (rather than genome), are they included in this 70.6%? In either case, does junctions.bed file contains splice junctions which mapped uniquely to transcriptome? Does 70.6% refer to both, reads uniquely mapped to genome and splice junctions uniquely mapped to transcriptome?
Would appreciate a clarification.
Thank you,
Ephraim Trakhtenberg
Thank you, this answers my question, to summarize: The last line in the
align_summary.txt
file is not a summary of all of the above but rather information specifically regarding the concordant alignment, including uniquely and multiply mapped reads pair.accepted_hits.bam
does not contain exclusively uniquely mapped reads pairs. It is unclear whether the junctions.bed file contains only splice junctions that are uniquely mapped based on alignments or multimappers too. And it does not appear that thealign_summary.txt
provides any statistics on the mapping of splice junctions that is based on alignments. So this leads to other questions which I now posted here: Extracting from tophat outputs reads pairs and splice-junctions with a single best match