Forum:Best RNA-Seq aligner: A comparison of mapping tools
1
9
Entering edit mode
5.8 years ago

RNA-Seq has replaced microarrays for many applications in the area of biomarker discovery. The prices have been fallen substantially in recent years. The sequence data allows to extract more information than gene expression only. And there is no requirement that a reference genome must exist. However, the analysis of the resulting data is much more challenging and requires more ressources than other approaches.

So the question is: How to choose between the available analysis tools?

read more ...

enter image description here

NGS alignment genome RNA-Seq sequence • 17k views
ADD COMMENT
3
Entering edit mode

Interesting! Is there a reason your comparison is missing Tophat(2) and Hisat?

Also, I believe one of the ten commandments was "thou shalt not chop the y-axis to visually inflate differences which are actually minor" (referencing to bar chart of on-target hits)... ;-)

ADD REPLY
3
Entering edit mode

We actually chopped the y-axis to visualise these minor differences, since people might be interested in exactly these. I think it is pretty clear that the y-axis is chopped and that the differences are minor. But thanks for highlighting it. We will add an extra note to the plot, to assure that the graph is not misleading.

Regarding Tophat(2) and Hisat: There are many other tools that are also missing. This is on-going work. We encourage all readers to add a particular NGS mapper to this comparison. Please feel free to download the test set, do the analysis, send us the results and we will add it to the benchmarking. A detailed explanation on how to run the benchmark can be found here (goto 3. Read aligner comparison)

ADD REPLY
4
Entering edit mode

True. When I just scroll through your post I see a large difference between e.g. bwa-mem and STAR, but indeed when I spend more attention it is clear that the y-axis is chopped. This is just my opinion, but I'd say that plots have to be visually clear immediately. And in that context, the on-target hits is maybe not even worth showing given that they're all that similar.

I mentioned TopHat because it has been extremely widely used, and while superseded by other aligners is still one of the most used tools in 'high-impact publications'. It's more or less the "standard" go-to tool. We have been pushing back on people using TopHat ad infinitum, but it would be interesting to see if that's justified.

ADD REPLY
2
Entering edit mode

I completely see your point, but sometimes it is of high interest to detect the 2% which can not be found by some mappers. 2% of some billion reads is actually quite a number. And in these cases, I would go for a special mapping tool? So, I think it is important to highlight these minor differences and not hide them by not chopping the axis. But that is just my opinion and I'm in 99% of cases absolutely with you.... please do not chop the axis.

ADD REPLY
3
Entering edit mode

In addition to Wouter's comment, I find these types of reports misrepresentative of the tools that you are using. Only 1 sample was used in your report, for example, whereas these tools were designed using different types of input data. I could likely do my own report and flip all of your findings the other way. You have neither shown the exact parameter configurations of your commands used for each tool. Each has defaults that differ.

As a practical example: I recently saw a report on variant callers from the Broad / Google group. They naturally found that GATK and DeepVariant were 'better' than other variant callers. From my experience, for germline single nucleotide variants, samtools/bcftools mpileup has higher sensitivity/specificity to the gold standard than all others. For indels, this is not the case.

The point is that each tool has its own niche area, some overlapping of course.

A disclaimer: I am neither currently nor never have been involved in any way in the development of any of these listed tools

ADD REPLY
2
Entering edit mode
  1. If you follow the links, you find the exact parameters for every single tool.
  2. The question we asked is pretty straight forward: 'If a full sensitive mapper finds a hit for a read, is a mapping tool able to locate a hit at the same position.' Regardless of the design of the tool, it should be able to find it, no?
  3. Feel free to redo the analysis and show with your own dataset that the graph look highly different. I'm curious actually.
ADD REPLY
3
Entering edit mode

Thank you for elaborating on my comments, David.

ADD REPLY
2
Entering edit mode

And the discussion with n=1 we already had. I see your point. :) But still, all scripts and calls are given. We really really encourage all curious researchers to do the benchmarking themselves, since you really start to understand these mapping tools, when you do that. Based on experience, most non-bioinformaticians just take a tool that was used in a publication and use it with default parameters. That's in many cases just wrong.

ADD REPLY
3
Entering edit mode

most non-bioinformaticians just take a tool that was used in a publication and use it with default parameters

Yes, that is correct, and not a good practice of course. I appreciate your efforts with your report, in this regard. I should add that a good report is one that encourages discussion / debate :)

ADD REPLY
2
Entering edit mode

I think there is no perfect benchmark possible. And all tools mentioned are really good. We use all of them during courses and/or in-house.

And, of course, you’re absolutely right: the discussions here are awesome and always very fruitful. But that’s due to the fact that this forum is full of highly intelligent people and real experts in this field. That’s why I spend quite some time here. You always learn something new. Even when you thought you know everything in a specific field, there are people to show you better ways to do something.

ADD REPLY
0
Entering edit mode

If you follow the links, you find the exact parameters for every single tool.

Perhaps I am blind but I am not sure where these are on the blog page?

And all tools mentioned are really good.

If that is the case what is the reason for publishing this comparison?

ADD REPLY
3
Entering edit mode
  1. You have to follow the links. :) If you click on the ‚here‘ of ‘Please find more information in the benchmark details here.’, you will be forwarded to another page and at the bottom of this is a link to the proceedings. There is everything then: scripts, calls, datasets, etc.

  2. Sometimes you need a tool that is very sensitive, sometimes you need a tool that is very fast, sometimes you need a tool that is very memory efficient ... there might be some situations where you might actually want to use the results of such a benchmark to make a decision.

At least, it makes some people think about the differences of mappers. I think it is an interesting benchmark. Is it a perfect benchmark? No.

ADD REPLY
0
Entering edit mode

If I follow the links the page I reach is this. One link I see on there is for a paper from 2014. Other takes me to this page, which does not appear to contain information about the current study published here (I am only seeing STAR, blat and segmehl command lines). Would you mind providing a direct link if I am overlooking it?

Benchmark serves as a standard by which other things are measured. None of the tools here can be considered a benchmark. So I feel that while very valuable this is just a comparison of aligners. As you mention above people will need to make a choice based on their use specific case and requirements.

One point that is missing from your comparison is information on ability of tools to run on multiple platforms. BBMap for example is able to run on Windows, macOS and unix as long as Java is available. How many of the other tools are able to do that? Also these tools are written various languages, use different methodologies and as a result they are going to have inherent differences in their resource use/execution times.

Disclaimer: I am not a tool developer and am not associated with development of tools mentioned. I am a trainer (just like you) and OS/program agnostic as far as practical. I do use a lot of BBTools.

ADD REPLY
2
Entering edit mode

The point with the different languages and platforms is a good one. I always think everyone is using Linux, but obviously for quite a number a Windows solution is of interest. We should add this information to our next update.

And also the exact calls of the tools we compared. It should be directly on the page. It’s on the todo list. Thanks for the input.

ADD REPLY
3
Entering edit mode

Just another benchmark to add

enter image description here

Baruzzo G, Hayer KE, Kim EJ, Di Camillo B, FitzGerald GA, Grant GR. (2016) Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods.

ADD REPLY
0
Entering edit mode

STAR looks similar and Tophat2 not so good. Now I'm curious about Tophat2, too. :)

ADD REPLY
3
Entering edit mode

Authors of Tophat begged to stop using it. It's a very outdated tool.

ADD REPLY
1
Entering edit mode

I have definitely seen this before, and I feel kind of bad always tagging a comment to this (but I think the counter-point is kind of important).

With all due respect, I strongly disagree about the conclusion about not using TopHat. In terms of the general idea, I think you need to plan to have at least some testing for each project (most likely, requiring a substantial amount of time for analysis and discussion - and, hopefully, discovering something new and unexpected). However, in terms of the less controversial topic of how TopHat can be useful:

  1. I know of at least one study (currently unpublished, as far as I know), where I needed to use TopHat to recover a known splicing event

  2. I've had situations where the more conservative alignment and/or unaligned reads from TopHat were useful

  3. For gene expression, if I see something potentially strange with a TopHat alignment, I have yet to find an example where the trend was noticeably different (there probably are some examples for some genes, but I would guess the gene expression quantification with htseq-count are usually similar for TopHat and STAR single-end alignments for most genes)

(please note that I copied this from A: Tophat problem - R1 and R2 reads mapped to different strandeds of viral genome )

You might also want to check out Tophat2/Bowtie failing to align some examples of partial intron retention / exon loss

To be fair, when I was in Michigan, I essentially recommended people to use minfi over COHCAP. However, I think I was being unfairly hard on myself (although there are a lot of post-2016 updates, when I returned to City of Hope). While I am not necessarily saying this is not representative of Lior's current opinion, it is important to keep in mind that opinions can change over time (and it is important to take time to develop your own opinion, especially in the context of your own dataset).

Also, I think an important positive note is that you can still have some amount of support for popular open-source software through discussions like this forum. In other words, even if the software is not officially supported, it can still be very useful (and I would have a problem if this software was not continued to be made freely available to download, even without direct support from the developer).

ADD REPLY
2
Entering edit mode

Each tool will definitely have situations in which one is a better choice than the others, but I don't think most users make this comparison. At the same time it is pointless to do this comparison over and over again while comprehensive benchmarks have been published.

What we generally try to push back against is new users blindly picking Tophat for their project, only because it has so many citations and is part of so many tutorials.

ADD REPLY
0
Entering edit mode

If people encounter problems, it is very important to know that other methods are available for troubleshooting.

Not spending enough time on a project can be a serious issue, and I believe people need to gradually expand their skill sets (and plan for a substantial amount of time for analysis of high-throughput data).

So, I think performing some of your own benchmarks is important to help gain an intuition about the analysis (and what results are most robust for your particular dataset - or, if there are differences, trying to understand the underlying cause of those differences).

In other words, in the most polite way possible, I disagree with the statement "it is pointless to do this comparison over and over again while comprehensive benchmarks have been published," but I think I otherwise agree with you (in saying you shouldn't immediately use the first result you get in a publication).

ADD REPLY
2
Entering edit mode

I think there is a difference between using tophat to detect some very special splice sites in very special cases (and probably loosing a lot of others) and tophat being still one of the most cited tool (with knowing that all these people using it will most likely loose a lot of splice sites).

ADD REPLY
0
Entering edit mode

I believe this paper (or at least some figures) were discussed using an IGC Bioinformatics meeting.

Figure 3 is for the "T3" sequences with more mutations - the T1 and T2 are probably more typical for your experiment.

While I have seen some situations where more divergent sequence aligned with STAR but not TopHat, that could be either a "bug" or a "feature" depending upon your project. For example, if that divergent sequence is from a not 100% pure mouse Xenograft (or some other unintended sequence), perhaps noticing the different mutation / splicing results could help you see the need to revise your analysis strategy (such as having a joint genome reference, where at the locus where the different alignments actually was more similar to the TopHat alignment, or the desired species). On the flip side, if you were using a well annotated genome for analysis in another species, perhaps that would be a disadvantage.

In terms of gene expression (and overall alignment), I think Supplemental Figure 4 T1/T2 matches my experience that the TopHat alignment is usually OK. However, it's also possible that something else about the simulated data may be affecting performance.

Either way, it is extremely important to take your time and review your data carefully before publication (with several iterations of analysis and discussion). I would say this should involve some testing of different methods for each project. However, understanding the concept that the TopHat alignments may be less mutation tolerant (with default parameters) is worth considering.

P.S. If you happened to look at this before my last update, I did change the Supplemental Figure that I link to (Supplemental Figure 11 looks similar, but represents something different).

ADD REPLY
1
Entering edit mode

Sorry, but I don't understand your arguments.. The main difference for well established genomes and transcriptomes would be speed. And Tophat2 is so, so slow. As far as other things you mentioned,

1) In STAR, you can provide GTF file when generating reference, so when if you really want that known splice site to be there, it will be.

2) Having a more conservative (which is not a clearly defined term) alignment is a matter of adjusting options, not changing aligners. And both hisat2 and STAR can save un-aligned reads separately.

3) Seeing something "strange" is really not a good way to assess an alignment program in general. Simulated datasets are probably the most consistent and reproducible way. And htseq-count should also not be used, it doesn't count multimapping reads, and is also very slow. If you insist on simple quantification, featureCounts would give the same result a lot faster.

ADD REPLY
0
Entering edit mode

With 50 bp SE reads, I don't think the speed difference isn't so bad (at least if limited to 4 cores and 8 GB for each method). However, I think speed is more of an issue with PE reads.

1) In this situation, there was an event with an exogenous sequence, not in the reference genome. It seems like you could potentially still see the difference in frequencies with the difference in annotated frequencies (which is essentially what I did, but I believe there was some sort of issue with that).

2) I think there is value in having multiple open-source aligners (and, if TopHat2 was free, and STAR/HISAT cost money, I would say TopHat2 is good enough to start with for many projects - although I am very thankful that is not the case). I suspect that there situations where you would not get completely identical results that can be achieved with different parameters for different programs, but you have a good point about taking time to become more familiar with becoming more familiar with the parameters available for the different programs.

3a) I think one issue about the simulations is whether they are realistic for what is typically encountered in actual samples. Also, I suspect no one strategy for simulating data will work equally well for all projects.

3b) I seem to remember there was something a little different about PE versus SE in reads (for htseq-count versus featureCounts, although maybe I am thinking about exonic rather than gene quantifications). For SE reads, you can use featureCounts to get faster unique reads (however, if you are quantifying samples in parallel, I don't think this is a huge issue).

3c) I would typically be choosing to use htseq-count (or the most similar set of parameters for featureCounts) with the intention of focusing on the unique reads. While I do consider use of a transcriptome quantification as a troubleshooting option, I think having replicates (which I would always recommend) can be useful if transcripts with less accurate assignments of multi-mapped reads show more variability between replicates (so, if the multi-mapped read assignments are less robust, presumably the p-values will be higher; I would argue that is good for helping filter some false positives at either the gene or transcript level, but that would also be an example where it may be better to use unique reads for a gene quantification).

ADD REPLY
1
Entering edit mode

I think many people (including myself, until I read it more carefully) may be misinterpreting that tweet:

Strictly speaking, Lior is talking about TopHat1 (he is even saying TopHat2 is greatly improved over TopHat1).

While I still think there is value in having the code for previous versions of TopHat (such as TopHat1), the tweet is not exactly part of the TopHat2 versus STAR versus HISAT/HISAT2 debate.

I have tried to collect my thoughts in the matter here:

http://cdwscience.blogspot.com/2019/01/tophat-really-isnt-that-bad.html

ADD REPLY
0
Entering edit mode

Hi All,

I'm not sure if this is an appropriate post on here (I'm new).

I was wondering if anyone has experience using subread for alignments, and if so, would you be able to help me out on its implementation. I keep getting the error 'Unable to open index '/path/to/index/'

But I got no errors when making the index, so I'm unsure.

Thanks! Amy

ADD REPLY
0
Entering edit mode

I keep getting the error 'Unable to open index '/path/to/index/'

You have to provide an actual path on your own system where the index is located. /path/to/system is just a place holder that does not refer to a real PATH.

ADD REPLY
0
Entering edit mode

Hi, thanks for your reply. I meant this is the code I entered (please excuse my messy files):

subread-align -i home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index -r /home/amyhouseman/Desktop/katja_fasta_seq/fasta_trimmed_sequences/fasta_trimmed_1_all/fasta_trimmed_1_final -t 1 -o /home/amyhouseman/Desktop/subread/subreadresults.bam

and this is the error message I got back:

Unable top open index 'home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index'. Please make sure that the correct prefix is specified and you have the permission to read these files. For example, if there are files '/opt/my_index.reads', '/opt/my_index.files' and etc, the index prefix should be specified as '/opt/my_index' without any suffix.

Thanks! Amy

ADD REPLY
0
Entering edit mode

It appears that you are missing a leading / in your index path specification. Should it not be /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index?

ADD REPLY
0
Entering edit mode

ah yes! Such a simple error, thank you!

However, now it says 'Segmentation fault (core dumped)' - which I assume is a memory error? But I checked the system monitor and I'm only using 20% - Thanks for your help, I really appreciate it

ADD REPLY
0
Entering edit mode

subread is an aligner with one of smallest memory requirements so that is surprising. Do you have 8 GB (or less) RAM?

ADD REPLY
0
Entering edit mode

Hi, I'm not sure, when I go to settings it says: Memory 5.7GiB.. Disk capacity 256.1GB. Although, I did dual boot the laptop so it has both windows and Ubuntu. So that might not be helping? Thanks!

ADD REPLY
0
Entering edit mode

What is the size of the genome/reference you are working with? Dual booting should be ok but if you have only 8GB of RAM (which seems to be the case) then I am afraid you may be out of luck.

ADD REPLY
0
Entering edit mode

Its around 2,453bp (its the coding seq for https://www.ncbi.nlm.nih.gov/nuccore/NM_000359.3).

I think my laptop may only be 8GB of RAM as I think its this one: https://www.laptopsdirect.co.uk/lenovo-v15-ada-amd-ryzen-3-3250u-8gb-256gb-ssd-15.6-inch-fhd-windows-10-lap-82c70004uk/version.asp

Sorry for all this information, I am new to all of this and don't have much help on my end.

Do you think there's an alternative or would I have to use a web based aligner?

Thanks! Amy

ADD REPLY
0
Entering edit mode

That is a tiny reference so this should work on your laptop. Perhaps you did not make the index successfully. What is the output of ls -lh /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index?

ADD REPLY
0
Entering edit mode

Hi!

The return for that is:

-rw-rw-r-- 1 amyhouseman amyhouseman 0 Feb  9 12:51 /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index

The process I did to get to this point was:

  1. Installed subread from the website: http://bioinf.wehi.edu.au/subread-package/
  2. sudo apt install subread
  3. Used tar zxvf subread-2.0.1-source.tar.gz
  4. Moved into the src file.
  5. make -f Makefile.Linux

to make the Index:

subread-buildindex -o /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_ref_sequence

Thanks for your help! Amy

ADD REPLY
0
Entering edit mode

As you can see the index file is actually empty i.e. zero bytes. That means your index prep did not work at all.

Looking at the subread manual you need to create indexes by providing fasta format files to the index command

subread-buildindex -o my_index chr1.fa chr2.fa

Is /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_ref_sequence actually a fasta format sequence file? You may have to explicitly name the file TGM1_ref_sequence.fa.

ADD REPLY
0
Entering edit mode

Hi, I made that change but it still doesn't seem to work, the command does seem to create 5 file types but they end up in the fasta_reference_TGM1_folder.

If I manually move these files into the TGM1_index file and then run:

subread-align -T 1 -i /home/amyhouseman/Desktop/fasta_reference_TGM1_folder/TGM1_index/TGM1_index -t 1 -r /home/amyhouseman/Desktop/katja_fasta_seq/fasta_trimmed_sequences/fasta_trimmed_1_all/fasta_trimmed_1_final.fasta -o my_results.sam

a file called 'my_results.sam' and 'my_results.sam.indel.vcf' is made, but the 'my_results.sam' file is a Lotus AmiPro document?

I'm not entirely sure what all this means, and also the index still doesn't go into the correct place in the first place?

Thanks! Amy

ADD REPLY
0
Entering edit mode

my_results.sam' file is a Lotus AmiPro document?

If subread did its work properly then this should be the correct file containing read alignments. Your computer may be mis-identifying it as above type since .sam extension is assigned to that type of program.

What does head -10 my_results.sam show? Is this RNAseq data? What are you trying to do?

ADD REPLY
0
Entering edit mode

I have some DNA nucleotide sequences from exons in TGM1 gene of people with disease, so I'm trying to align them to the reference coding sequence of TGM1, and then hopefully find the variants and then use another tool after to find the pathogenicity of each variant? If that makes sense.

Do you know of a way to open the my_results.sam file?

ADD REPLY
0
Entering edit mode

my_results.sam should be a text file and can be opened in an appropriate editor.

This point onwards (assuming your alignments worked well) you will need to use samtools to convert this SAM file to BAM, sort and index the file. Then you will need to use something like bcftools to call variants.

What OS are you doing this on? At some point in time you will need to do this on linux since some of programs are not available on Windows.

If this was data from whole genome then aligning to just TGM1 was not appropriate. Using a reduced reference always leads to possibility of errant alignments.

Since this discussion is now unrelated to original question please post a new question in a new thread if you have any follow-up.

ADD REPLY
3
Entering edit mode
5.8 years ago
Gordon Smyth ★ 7.6k

Subread is another key tool that should at least be included in the comparison, and might well win it. See our recent paper:

Liao et al (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research https://doi.org/10.1093/nar/gkz114

The comparisons in the article are with TopHat2 and STAR, but our unpublished comparisons show that subread outperforms HISAT2 as well on the same tests.

The article focuses on the R package implementation of subread, but the same performance metrics would hold for the unix command line version.

ADD COMMENT
1
Entering edit mode

I don't think it is quite right to think of there being an overall "winner" - I think each user needs to be able to develop their own opinion (in the context of their own experiment). So, I believe comparisons like this can be helpful, but I would still say there needs to be several rounds of analysis / discussion for each project to critically assess the results (which should probably involve at least some methods testing for every project).

However, thank you for sharing the link, and I certainly think it is helpful to have Rsubread as an option!

ADD REPLY

Login before adding your answer.

Traffic: 2234 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6