Question

tophat output files containing the reads which mapped uniquely as a pair

0

Entering edit mode

10.7 years ago

trakhtenberg ▴ 160

There is a file called align_summary.txt in the tophat folder (generated by running tophat) which says:

Left reads:
              Input: 128979165
              Mapped:  98314933 (76.2% of input)
              of these:  11898655 (12.1%) have multiple alignments (9004 have >20)
Right reads:
              Input: 128979165
              Mapped:  95536410 (74.1% of input)
              of these:  10769172 (11.3%) have multiple alignments (2289 have >20)
75.1% overall read alignment rate.
Aligned pairs:  92923521
     of these:   8913959 ( 9.6%) have multiple alignments
          and:   1899417 ( 2.0%) are discordant alignments
70.6% concordant pair alignment rate.

Does what it says at the end "70.6% concordant pair alignment rate" mean that 70.6% of pair-end reads mapped uniquely (single match) as a pair? And are these 70.6% of paired reads is what included in the accepted_hits.bam?

What about splice junctions which mapped uniquely to transcriptome (rather than genome), are they included in this 70.6%? In either case, does junctions.bed file contains splice junctions which mapped uniquely to transcriptome? Does 70.6% refer to both, reads uniquely mapped to genome and splice junctions uniquely mapped to transcriptome?

Would appreciate a clarification.

Thank you,
Ephraim Trakhtenberg

TOPHAT RNA-Seq • 12k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by trakhtenberg ▴ 160

Ram · Answer 1 · 2014-08-21

A concordant alignment is defined as a pair on the same chromosome/contig with the proper orientation (typically pointing toward each other) with an appropriate distance between their extrema (due to size selection, though remember that the reasonableness of a distance is dependent on its transcript-space representation). Hopefully that was slightly clearer than mud.

So, this 70.6% number includes "unique" mappers and multi-mappers. These reads are among those included in the accepted_hits.bam file, though they won't be all of them. Any alignment produced is placed in accepted_hits.bam. This 70.6% is unrelated to splice junctions, novel or otherwise. The splice junctions are derived from looking at the alignments, but the junctions themselves wouldn't be stored in a BAM file. My guess is that you're trying to ask if tophat2 uses multimappers in finding splice junctions. I don't actually know the answer to that, though I would suspect not (it'd raise the false-positive rate).

Hopefully that clarifies things.