RNA-Seq has replaced microarrays for many applications in the area of biomarker discovery. The prices have been fallen substantially in recent years. The sequence data allows to extract more information than gene expression only. And there is no requirement that a reference genome must exist. However, the analysis of the resulting data is much more challenging and requires more ressources than other approaches.
So the question is: How to choose between the available analysis tools?
Interesting! Is there a reason your comparison is missing Tophat(2) and Hisat?
Also, I believe one of the ten commandments was "thou shalt not chop the y-axis to visually inflate differences which are actually minor" (referencing to bar chart of on-target hits)... ;-)
We actually chopped the y-axis to visualise these minor differences, since people might be interested in exactly these. I think it is pretty clear that the y-axis is chopped and that the differences are minor. But thanks for highlighting it. We will add an extra note to the plot, to assure that the graph is not misleading.
Regarding Tophat(2) and Hisat: There are many other tools that are also missing. This is on-going work. We encourage all readers to add a particular NGS mapper to this comparison. Please feel free to download the test set, do the analysis, send us the results and we will add it to the benchmarking. A detailed explanation on how to run the benchmark can be found here (goto 3. Read aligner comparison)
True. When I just scroll through your post I see a large difference between e.g. bwa-mem and STAR, but indeed when I spend more attention it is clear that the y-axis is chopped. This is just my opinion, but I'd say that plots have to be visually clear immediately. And in that context, the on-target hits is maybe not even worth showing given that they're all that similar.
I mentioned TopHat because it has been extremely widely used, and while superseded by other aligners is still one of the most used tools in 'high-impact publications'. It's more or less the "standard" go-to tool. We have been pushing back on people using TopHat ad infinitum, but it would be interesting to see if that's justified.
I completely see your point, but sometimes it is of high interest to detect the 2% which can not be found by some mappers. 2% of some billion reads is actually quite a number. And in these cases, I would go for a special mapping tool? So, I think it is important to highlight these minor differences and not hide them by not chopping the axis. But that is just my opinion and I'm in 99% of cases absolutely with you.... please do not chop the axis.
In addition to Wouter's comment, I find these types of reports misrepresentative of the tools that you are using. Only 1 sample was used in your report, for example, whereas these tools were designed using different types of input data. I could likely do my own report and flip all of your findings the other way. You have neither shown the exact parameter configurations of your commands used for each tool. Each has defaults that differ.
As a practical example: I recently saw a report on variant callers from the Broad / Google group. They naturally found that GATK and DeepVariant were 'better' than other variant callers. From my experience, for germline single nucleotide variants, samtools/bcftools mpileup has higher sensitivity/specificity to the gold standard than all others. For indels, this is not the case.
The point is that each tool has its own niche area, some overlapping of course.
A disclaimer: I am neither currently nor never have been involved in any way in the development of any of these listed tools
Thank you for elaborating on my comments, David.
And the discussion with n=1 we already had. I see your point. :) But still, all scripts and calls are given. We really really encourage all curious researchers to do the benchmarking themselves, since you really start to understand these mapping tools, when you do that. Based on experience, most non-bioinformaticians just take a tool that was used in a publication and use it with default parameters. That's in many cases just wrong.
Yes, that is correct, and not a good practice of course. I appreciate your efforts with your report, in this regard. I should add that a good report is one that encourages discussion / debate :)
I think there is no perfect benchmark possible. And all tools mentioned are really good. We use all of them during courses and/or in-house.
And, of course, you’re absolutely right: the discussions here are awesome and always very fruitful. But that’s due to the fact that this forum is full of highly intelligent people and real experts in this field. That’s why I spend quite some time here. You always learn something new. Even when you thought you know everything in a specific field, there are people to show you better ways to do something.
Perhaps I am blind but I am not sure where these are on the blog page?
If that is the case what is the reason for publishing this comparison?
You have to follow the links. :) If you click on the ‚here‘ of ‘Please find more information in the benchmark details here.’, you will be forwarded to another page and at the bottom of this is a link to the proceedings. There is everything then: scripts, calls, datasets, etc.
Sometimes you need a tool that is very sensitive, sometimes you need a tool that is very fast, sometimes you need a tool that is very memory efficient ... there might be some situations where you might actually want to use the results of such a benchmark to make a decision.
At least, it makes some people think about the differences of mappers. I think it is an interesting benchmark. Is it a perfect benchmark? No.
If I follow the links the page I reach is this. One link I see on there is for a paper from 2014. Other takes me to this page, which does not appear to contain information about the current study published here (I am only seeing
STAR
,blat
andsegmehl
command lines). Would you mind providing a direct link if I am overlooking it?Benchmark
serves as a standard by which other things are measured. None of the tools here can be considered a benchmark. So I feel that while very valuable this is just a comparison of aligners. As you mention above people will need to make a choice based on their use specific case and requirements.One point that is missing from your comparison is information on ability of tools to run on multiple platforms. BBMap for example is able to run on Windows, macOS and unix as long as Java is available. How many of the other tools are able to do that? Also these tools are written various languages, use different methodologies and as a result they are going to have inherent differences in their resource use/execution times.
Disclaimer: I am not a tool developer and am not associated with development of tools mentioned. I am a trainer (just like you) and OS/program agnostic as far as practical. I do use a lot of BBTools.
The point with the different languages and platforms is a good one. I always think everyone is using Linux, but obviously for quite a number a Windows solution is of interest. We should add this information to our next update.
And also the exact calls of the tools we compared. It should be directly on the page. It’s on the todo list. Thanks for the input.
Just another benchmark to add
Baruzzo G, Hayer KE, Kim EJ, Di Camillo B, FitzGerald GA, Grant GR. (2016) Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods.
STAR looks similar and Tophat2 not so good. Now I'm curious about Tophat2, too. :)
Authors of Tophat begged to stop using it. It's a very outdated tool.
I have definitely seen this before, and I feel kind of bad always tagging a comment to this (but I think the counter-point is kind of important).
With all due respect, I strongly disagree about the conclusion about not using TopHat. In terms of the general idea, I think you need to plan to have at least some testing for each project (most likely, requiring a substantial amount of time for analysis and discussion - and, hopefully, discovering something new and unexpected). However, in terms of the less controversial topic of how TopHat can be useful:
I know of at least one study (currently unpublished, as far as I know), where I needed to use TopHat to recover a known splicing event
I've had situations where the more conservative alignment and/or unaligned reads from TopHat were useful
For gene expression, if I see something potentially strange with a TopHat alignment, I have yet to find an example where the trend was noticeably different (there probably are some examples for some genes, but I would guess the gene expression quantification with htseq-count are usually similar for TopHat and STAR single-end alignments for most genes)
(please note that I copied this from A: Tophat problem - R1 and R2 reads mapped to different strandeds of viral genome )
You might also want to check out Tophat2/Bowtie failing to align some examples of partial intron retention / exon loss
To be fair, when I was in Michigan, I essentially recommended people to use minfi over COHCAP. However, I think I was being unfairly hard on myself (although there are a lot of post-2016 updates, when I returned to City of Hope). While I am not necessarily saying this is not representative of Lior's current opinion, it is important to keep in mind that opinions can change over time (and it is important to take time to develop your own opinion, especially in the context of your own dataset).
Also, I think an important positive note is that you can still have some amount of support for popular open-source software through discussions like this forum. In other words, even if the software is not officially supported, it can still be very useful (and I would have a problem if this software was not continued to be made freely available to download, even without direct support from the developer).
Each tool will definitely have situations in which one is a better choice than the others, but I don't think most users make this comparison. At the same time it is pointless to do this comparison over and over again while comprehensive benchmarks have been published.
What we generally try to push back against is new users blindly picking Tophat for their project, only because it has so many citations and is part of so many tutorials.
If people encounter problems, it is very important to know that other methods are available for troubleshooting.
Not spending enough time on a project can be a serious issue, and I believe people need to gradually expand their skill sets (and plan for a substantial amount of time for analysis of high-throughput data).
So, I think performing some of your own benchmarks is important to help gain an intuition about the analysis (and what results are most robust for your particular dataset - or, if there are differences, trying to understand the underlying cause of those differences).
In other words, in the most polite way possible, I disagree with the statement "it is pointless to do this comparison over and over again while comprehensive benchmarks have been published," but I think I otherwise agree with you (in saying you shouldn't immediately use the first result you get in a publication).
I think there is a difference between using tophat to detect some very special splice sites in very special cases (and probably loosing a lot of others) and tophat being still one of the most cited tool (with knowing that all these people using it will most likely loose a lot of splice sites).
I believe this paper (or at least some figures) were discussed using an IGC Bioinformatics meeting.
Figure 3 is for the "T3" sequences with more mutations - the T1 and T2 are probably more typical for your experiment.
While I have seen some situations where more divergent sequence aligned with STAR but not TopHat, that could be either a "bug" or a "feature" depending upon your project. For example, if that divergent sequence is from a not 100% pure mouse Xenograft (or some other unintended sequence), perhaps noticing the different mutation / splicing results could help you see the need to revise your analysis strategy (such as having a joint genome reference, where at the locus where the different alignments actually was more similar to the TopHat alignment, or the desired species). On the flip side, if you were using a well annotated genome for analysis in another species, perhaps that would be a disadvantage.
In terms of gene expression (and overall alignment), I think Supplemental Figure 4 T1/T2 matches my experience that the TopHat alignment is usually OK. However, it's also possible that something else about the simulated data may be affecting performance.
Either way, it is extremely important to take your time and review your data carefully before publication (with several iterations of analysis and discussion). I would say this should involve some testing of different methods for each project. However, understanding the concept that the TopHat alignments may be less mutation tolerant (with default parameters) is worth considering.
P.S. If you happened to look at this before my last update, I did change the Supplemental Figure that I link to (Supplemental Figure 11 looks similar, but represents something different).
Sorry, but I don't understand your arguments.. The main difference for well established genomes and transcriptomes would be speed. And Tophat2 is so, so slow. As far as other things you mentioned,
1) In STAR, you can provide GTF file when generating reference, so when if you really want that known splice site to be there, it will be.
2) Having a more conservative (which is not a clearly defined term) alignment is a matter of adjusting options, not changing aligners. And both hisat2 and STAR can save un-aligned reads separately.
3) Seeing something "strange" is really not a good way to assess an alignment program in general. Simulated datasets are probably the most consistent and reproducible way. And htseq-count should also not be used, it doesn't count multimapping reads, and is also very slow. If you insist on simple quantification, featureCounts would give the same result a lot faster.
With 50 bp SE reads, I don't think the speed difference isn't so bad (at least if limited to 4 cores and 8 GB for each method). However, I think speed is more of an issue with PE reads.
1) In this situation, there was an event with an exogenous sequence, not in the reference genome. It seems like you could potentially still see the difference in frequencies with the difference in annotated frequencies (which is essentially what I did, but I believe there was some sort of issue with that).
2) I think there is value in having multiple open-source aligners (and, if TopHat2 was free, and STAR/HISAT cost money, I would say TopHat2 is good enough to start with for many projects - although I am very thankful that is not the case). I suspect that there situations where you would not get completely identical results that can be achieved with different parameters for different programs, but you have a good point about taking time to become more familiar with becoming more familiar with the parameters available for the different programs.
3a) I think one issue about the simulations is whether they are realistic for what is typically encountered in actual samples. Also, I suspect no one strategy for simulating data will work equally well for all projects.
3b) I seem to remember there was something a little different about PE versus SE in reads (for htseq-count versus featureCounts, although maybe I am thinking about exonic rather than gene quantifications). For SE reads, you can use featureCounts to get faster unique reads (however, if you are quantifying samples in parallel, I don't think this is a huge issue).
3c) I would typically be choosing to use htseq-count (or the most similar set of parameters for featureCounts) with the intention of focusing on the unique reads. While I do consider use of a transcriptome quantification as a troubleshooting option, I think having replicates (which I would always recommend) can be useful if transcripts with less accurate assignments of multi-mapped reads show more variability between replicates (so, if the multi-mapped read assignments are less robust, presumably the p-values will be higher; I would argue that is good for helping filter some false positives at either the gene or transcript level, but that would also be an example where it may be better to use unique reads for a gene quantification).
I think many people (including myself, until I read it more carefully) may be misinterpreting that tweet:
Strictly speaking, Lior is talking about TopHat1 (he is even saying TopHat2 is greatly improved over TopHat1).
While I still think there is value in having the code for previous versions of TopHat (such as TopHat1), the tweet is not exactly part of the TopHat2 versus STAR versus HISAT/HISAT2 debate.
I have tried to collect my thoughts in the matter here:
http://cdwscience.blogspot.com/2019/01/tophat-really-isnt-that-bad.html
Hi All,
I'm not sure if this is an appropriate post on here (I'm new).
I was wondering if anyone has experience using subread for alignments, and if so, would you be able to help me out on its implementation. I keep getting the error 'Unable to open index '/path/to/index/'
But I got no errors when making the index, so I'm unsure.
Thanks! Amy
You have to provide an actual path on your own system where the index is located.
/path/to/system
is just a place holder that does not refer to a realPATH
.Hi, thanks for your reply. I meant this is the code I entered (please excuse my messy files):
and this is the error message I got back:
Thanks! Amy
It appears that you are missing a leading
/
in your index path specification. Should it not be/home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index
?ah yes! Such a simple error, thank you!
However, now it says 'Segmentation fault (core dumped)' - which I assume is a memory error? But I checked the system monitor and I'm only using 20% - Thanks for your help, I really appreciate it
subread
is an aligner with one of smallest memory requirements so that is surprising. Do you have 8 GB (or less) RAM?Hi, I'm not sure, when I go to settings it says: Memory 5.7GiB.. Disk capacity 256.1GB. Although, I did dual boot the laptop so it has both windows and Ubuntu. So that might not be helping? Thanks!
What is the size of the genome/reference you are working with? Dual booting should be ok but if you have only 8GB of RAM (which seems to be the case) then I am afraid you may be out of luck.
Its around 2,453bp (its the coding seq for https://www.ncbi.nlm.nih.gov/nuccore/NM_000359.3).
I think my laptop may only be 8GB of RAM as I think its this one: https://www.laptopsdirect.co.uk/lenovo-v15-ada-amd-ryzen-3-3250u-8gb-256gb-ssd-15.6-inch-fhd-windows-10-lap-82c70004uk/version.asp
Sorry for all this information, I am new to all of this and don't have much help on my end.
Do you think there's an alternative or would I have to use a web based aligner?
Thanks! Amy
That is a tiny reference so this should work on your laptop. Perhaps you did not make the index successfully. What is the output of
ls -lh /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index
?Hi!
The return for that is:
The process I did to get to this point was:
to make the Index:
Thanks for your help! Amy
As you can see the index file is actually empty i.e. zero bytes. That means your index prep did not work at all.
Looking at the
subread
manual you need to create indexes by providing fasta format files to the index commandIs
/home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_ref_sequence
actually a fasta format sequence file? You may have to explicitly name the fileTGM1_ref_sequence.fa
.Hi, I made that change but it still doesn't seem to work, the command does seem to create 5 file types but they end up in the
fasta_reference_TGM1_folder
.If I manually move these files into the
TGM1_index
file and then run:subread-align -T 1 -i /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index/TGM1_index -t 1 -r /home/amyhouseman/Desktop/katja_fasta_seq/fasta_trimmed_sequences/fasta_trimmed_1_all/fasta_trimmed_1_final.fasta -o my_results.sam
a file called 'my_results.sam' and 'my_results.sam.indel.vcf' is made, but the 'my_results.sam' file is a Lotus AmiPro document?
I'm not entirely sure what all this means, and also the index still doesn't go into the correct place in the first place?
Thanks! Amy
If
subread
did its work properly then this should be the correct file containing read alignments. Your computer may be mis-identifying it as above type since.sam
extension is assigned to that type of program.What does
head -10 my_results.sam
show? Is this RNAseq data? What are you trying to do?I have some DNA nucleotide sequences from exons in TGM1 gene of people with disease, so I'm trying to align them to the reference coding sequence of TGM1, and then hopefully find the variants and then use another tool after to find the pathogenicity of each variant? If that makes sense.
Do you know of a way to open the my_results.sam file?
my_results.sam
should be a text file and can be opened in an appropriate editor.This point onwards (assuming your alignments worked well) you will need to use
samtools
to convert this SAM file to BAM, sort and index the file. Then you will need to use something likebcftools
to call variants.What OS are you doing this on? At some point in time you will need to do this on linux since some of programs are not available on Windows.
If this was data from whole genome then aligning to just TGM1 was not appropriate. Using a reduced reference always leads to possibility of errant alignments.
Since this discussion is now unrelated to original question please post a new question in a new thread if you have any follow-up.