Question

tophat --max-multihits impacts 'mapped in a proper pair'

2

Entering edit mode

8.0 years ago

Carlo Yague 8.9k

Hi everyone,

I have a RNA-seq library (paired-end) with a large amount of multimappers because it comes from a total RNA extract with partial ribodepletion (~50 % of reads are from rRNAs). I tried mapping those reads with tophat changing the --max-multihits parameter (1 and 10), and I was surprised to see that the number of reads mapped in a proper pair is massively - from 38 to 87% - affected by that option.

samtools flagstat accepted_hits_1.bam # tophat --max-multihits 1 

111707163 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
111707163 + 0 mapped (100.00% : N/A)
111707163 + 0 paired in sequencing
54154286 + 0 read1
57552877 + 0 read2
42710174 + 0 properly paired (38.23% : N/A)
50502846 + 0 with itself and mate mapped
61204317 + 0 singletons (54.79% : N/A)
0 + 0 with mate mapped to a different chr

samtools flagstat accepted_hits_2.bam # tophat --max-multihits 10

379040545 + 0 in total (QC-passed reads + QC-failed reads)
267241809 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
379040545 + 0 mapped (100.00% : N/A)
111798736 + 0 paired in sequencing
54212619 + 0 read1
57586117 + 0 read2
96903430 + 0 properly paired (86.68% : N/A)
97210716 + 0 with itself and mate mapped
14588020 + 0 singletons (13.05% : N/A)
0 + 0 with mate mapped to a different chr

I first thought that this change was due the secondary alignment, but if I filter them out (samtools view -F 256), I still get 87% reads mapped in a proper pair.

What exactly happened here ?

My intuition is that reads are mapped independantly from their mate up to 1 or 10 times. After that tophat looks if there is one combination of position from the paired reads that leads to proper pairing and reports that pair of position. Can anyone confirm this or provide additional insight ?

Thank you,

Carlo

RNA-Seq tophat multireads multihits paired-end • 3.1k views

ADD COMMENT • link updated 8.0 years ago by John 13k • written 8.0 years ago by Carlo Yague 8.9k

0

Entering edit mode

8.0 years ago

John 13k

Macspider's answer is fantastic, but probably wrong.

After --max-multihits 10, you have 3.39x as many alignments in your BAM file. Horray.

Of those new alignments, most of them will be from reads that originally were properly mapped in a pair, but now also have roughly 2 more multimap alternatives (that are also properly mapped in a pair). Another way of phrasing it is, if a read didn't map, it isn't going to multi-map either. So your over-representing your well-mapped reads by including multimaps.

I bet if you count the number of reads properly mapped in a pair that aren't secondary, you'll see its still somewhere around 42710174.

If there was a secret switch that improved mapping by roughly 3.39x, everyone would be using it -_-;

(also, if there was a format that was fragment-centric, and you counted the fragments with at least 1 properly mapped pair alignment, there wouldn't be this confusion between reads and alignments. Alas, no format seems to exist.)

ADD COMMENT • link 8.0 years ago by John 13k

1

Entering edit mode

It might be wrong, but I'm not sure! Because I had a similar problem at the beginning of my PhD and I had a similar increase in the proper pair number without losing quality. I think that maybe there is something else that we might not taking into account, because after reading your answer I realized that mine is probably simplicistic.

ADD REPLY • link 8.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

No i'd say it's over-complicated. Good job on the PhD though! :) I'm still waiting for mine :(

ADD REPLY • link 8.0 years ago by John 13k

1

Entering edit mode

Hi John,

I understand your concern, I had the same doubt, but when I filtered out secondary alignment, I still got the same number of properly mapped pairs.

samtools view -F 256 accepted_hits_2.bam - | samtools flagstat

111798736 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
111798736 + 0 mapped (100.00% : N/A)
111798736 + 0 paired in sequencing
54212619 + 0 read1
57586117 + 0 read2
96903430 + 0 properly paired (86.68% : N/A)
97210716 + 0 with itself and mate mapped
14588020 + 0 singletons (13.05% : N/A)
0 + 0 with mate mapped to a different chr

ADD REPLY • link 8.0 years ago by Carlo Yague 8.9k

1

Entering edit mode

What. The. **.

OK first off I was wrong and Macspider was closer to the mark, but actually neither of us are right. From what i understand from Macspider's answer, he's saying that with --max-multihits 10, tophat will try 9 other seeds after the first one, find a better mapping, and that better mapping is more likely to generate a properly mapped pair.

But what these numbers are suggesting is that you went from:

50502846 + 0 with itself and mate mapped
61204317 + 0 singletons (54.79% : N/A)

to

97210716 + 0 with itself and mate mapped
14588020 + 0 singletons (13.05% : N/A)

Thus, max-multihits isn't just finding you better alignments, it's finding alignments where previously it hadn't found an alignment at all. Previously one of the pairs aligned, creating a singleton. Now both do, creating itself-and-mate-mapped. That accounts for pretty much all of the difference you are seeing in properly mapped in a pair.

Frankly, this is appalling, and you were absolutely right to be shocked and create this thread. What the hell is going on Tophat?! If you don't find a map from the first seed, shouldn't you retry as many seeds as possible until you do? Perhaps giving up after, I don't know, at least until you're out of the properly-mapped-in-a-pair window up/down stream of the mapped mate? OK, sure, if you found a map and its good, perhaps there's no need to try and find another max 10 multimaps... but wow. I'm really shocked.

ADD REPLY • link 8.0 years ago by John 13k

1

Entering edit mode

Nice catch, this is indeed quite suprising. This could explain why the default setting of tophat is 20 for -max--multihits when most people want to get rid of multi-reads.

ADD REPLY • link 8.0 years ago by Carlo Yague 8.9k

1

Entering edit mode

This happens because TopHat gradually releases the quality parameters to generate secondary alignments. The primary alignment will always be the same for a single end read, but with paired end reads it is better (imho) to take a pair that is within insert size with 3 mismatches in the seeds, rather than one out of the insert size but with just 1 mismatch; and you'll find the one with 3 mismatches only if you allow more multihits. This because variants can happen and your reads might contain one.

ADD REPLY • link 8.0 years ago by Matteo Schiavinato ★ 3.6k

score 4 · Accepted Answer · 2016-11-18

Hi Carlo,

The key part to understand what happens is in the TopHat help:

Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number.

Alignment works with a seed search and a subsequent seed extention applying boni and mali depending on the sequence similarity given a score matrix defined by the algorithm. This means that, if you specify --max-multihits 2 for example, the program will search the best seed and extend it, then it will search the second best seed and extend it, and it will report as primary alignment the best scoring of the two while down-ranking the second best to secondary alignment.

"I tried mapping those reads with tophat changing the --max-multihits parameter (1 and 10), and I was surprised to see that the number of reads mapped in a proper pair is massively - from 38 to 87%"

Given what I said before, what happens is that your command searches one seed, extends it and says "wow this is a primary alignment for this read!" but doesn't try another one, when you use --max-multihits 1. When you use --max-multihits 10, instead, it tries 10 times before saying which one is the best.

"What exactly happened here ?"

Same as what happens when you explore an energy landscape with a conformation of a protein, you might not end up with the best fold for the protein just as much as you might not end up with the best alignment for the read if you try only once (because local minima).

The proper pair is a pair of reads that map as primary alignments and within insert size. The more you try aligning reads, the more proper pairs you get because the reads that didn't find a good match at the first tryout, eventually find it at the second, or n-th.