Hello! I hope somebody can provide some insight into TopHat and Bowtie2 behavior. I have a data set that I aligned to a reference using Bowtie2 and Tophat, both running default settings. When I used Bowtie, the alignment rates were >90% for all samples. With Tophat, however, the rates were all around 45-55%. Since Tophat uses Bowtie, I'm not sure why the resulting alignment rates are so much lower. My best guess is that when Tophat calls Bowtie, it uses different settings than the default settings for a user using Bowtie directly, but I haven't been able to figure this out from the respective manuals. Could anyone explain why this might be? I'm curious because there's apparently something about my reads that is very sensitive to differences in aligners used, and I want to know what it is.
I'm not sure what additional information is needed to answer this question but I'm ready to provide it. Thank you!
ETA: I'm using Tophat2 and Bowtie2, latest versions of both
Hello am3930,
Please do not use TopHat
If you want to compare two genome aligners you can take a look at BWA and Bowtie2
I don't want to over-emphasize my point, but I strongly disagree with the conclusion that you shouldn't use TopHat (and that tweet is actually about TopHat1, not TopHat2).
I have some points about that saved here:
http://cdwscience.blogspot.com/2019/01/tophat-really-isnt-that-bad.html
(also, for what it is worth, Lior didn't add a comment to that post, but he did tweet about it, considerably increasing viewership)
I will be short I do not want to argue all day too,
As you pointed out in your blog, in some rare scenario TopHat2 can still be useful.
My point is that most posts on Biostars are about global RNAseq analysis using TopHat which is not the best tool in term of memory usage, running time and accuracy for general RNAseq experiments
https://www.nature.com/articles/nmeth.2722
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5792058/
Before the edit of his thread, OP didn't mentionned if they were using TopHat2 and still did not mention if they have DNA or RNA reads, if they want to discover novel splicing events, do a differential expression analysis...
Lior Pachter's tweet is about TopHat version 1 and in the same vein created TopHat2, HISAT and HISAT2
You can also read on TopHat website
I'm more inclined to believe someone that wrote down the tool and perfectly know how it works than someone that use TopHat and get good results in specific conditions (no offense at all)
With genuine thanks to everyone here for taking the time to answer, I'd like to gently suggest that we move away from discussion of the "best" tools in some global sense. I'm not looking for advice on how to do my final analysis. As I explained, I'm motivated by curiosity. My goal in THIS instance is to understand why two tools based on the same underlying alignment method produce such wildly different alignment rates. Thank you!
I think it is important to respect am3's wishes, so I will try to be brief and not have another comment about "best" aligners on this thread:
1) You are right that is posted on the TopHat2 website. I think it is important that the code for TopHat2 still be available (and you can find useful applications that were not explicitly designed by the developers; plus, I even had a period of time when I recommended people not use COHCAP, so I have disagreed with myself as a developer). However, I should be more careful not to imply that the developers agree with my opinion.
2) I think the scenarios where TopHat2 is not useful are actually limited. For example, I have used TopHat2 for all the labs mentioned in this acknowledgement (which I guess could arguably be like a placeholder until there can be some sort of more formal paper; however, I apologize that you need to scroll down to see the long acknowledgement). If something did seem off about the TopHat2 alignment for an important gene, I always tested a STAR alignment (but I never found that changed the trend of expression for any of the genes that I have checked, at least so far). So, if you have 50 bp SE reads, I think use of TopHat2 is usually quite reasonable.
3) Lior is not an author on TopHat2 or HISAT papers (only on TopHat1, and he isn't even acknowledged in the HISAT paper). However, again, I should be focusing more on my first-hand experience and not implying that other people necessarily agree with me (so, I apologize about that).
4) I'll try to take some time to look into the Engström et al. paper more closely, but I already have a response about the Baruzzo et al. paper: the simulated data T2 and T3 categories are less typical of what you would actually observe in an experiment. In terms of showing the alignment rate is lower in more divergent sequences, it actually complements my argument that TopHat2 can be useful as a more conservative alignment (if you want to avoid alignments from unintended sequence / contamination).
5) (update) I agree that there can be differences where TopHat2 may not be as good of an option with paired-end data. I can also believe that there are examples where TopHat2 is not the best option for splicing analysis. --> However, I most frequently encounter ~50 bp SE reads, so perhaps that explains some differences between my observations and the Engström et al. paper (although I actually think Figure 6 in that paper looks pretty good for Tophat2).
(update) That said, thank you very much for sharing those links to reviews. For example, it may not be as relevant for gene expression analysis (which is what I was emphasizing), but perhaps Figure 2b may be worth considering (although maybe GATK functions like
SplitNCigarReads
and GATK parameters like-dontUseSoftClippedBases
can make TopHat2 and STAR results relatively more similar? Plus, to be fair, I would admittedly probably use a STAR alignment for an "initial" analysis of mutation calling for RNA-Seq data, but with alignment post-processing that I wouldn't use for gene expression analysis)I also apologize that this wan't so brief (which I realized after posting).
Likewise, I was also trying to give some indication of when I updated this comment later in the day (to avoid having a whole different comment that may not be as relevant to the question), so I apologize for the messiness.
Thank you, I'm aware of the improved methods that exist, but I'm asking the question to satisfy my curiosity now, to better understand both my data and the methods.
According to the manual, Tophat indeed uses bowtie2: https://ccb.jhu.edu/software/tophat/manual.shtml, so I don't think that's it.
I know the general principle of how Tophat works (using bowtie2 to align first, then finding splice sites in unaligned reads). What I don't understand is how the final rate of aligned reads can be so much LOWER using Tophat, which I naively understand to work like "run bowtie2 and then do some more work on the stuff that doesn't align with bowtie2."
Tophat uses bowtie. Tophat2 uses Bowtie2.
Biggest difference between what Tophat2 is doing and what Bowtie2 is doing is that Bowtie2 can do local or "end-to-end"/global alignment.
By default Bowtie2 uses local alignment, which means only part of the read need map. However when TopHat2 uses Bowtie2 it instructs it to run in end-to-end mode. This means that the whole read needs to map from start to finish.
Thank you! I do indeed mean Tophat2. The manual just calls it Tophat (whereas the manuals for bowtie2 calls it bowtie2 every time), which is why I was imprecise.
I thought the issue might be local vs. end-to-end, however the bowtie2 manual says it does end-to-end by default: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment