Current state-of-the-art for RNA-Seq gene / transcript expression quantification OR just the tools of preference
1
37
Entering edit mode
9.3 years ago
IV ★ 1.3k

As always there are currently more pipelines and tools available for quantification of gene/transcript expression than ever before.

What has changed is that there are many different types of algorithms and strategies that affect how data are aligned (alignment / pseudoalignment / lightweight alignment, etc) and analyzed (results compatible or not with tools for general use).

For instance, we have numerous crossroads, with significant impact:

A. Gene expression: Gene counts vs Transcript-level quantification (and summation for each gene)

B. Transcripts: De novo vs reference transcriptome..

C. Alignment vs the new and super-fast kmer approaches (Sailfish, Kallisto, RNA-Skim, Salmon)

and so on..

The available tools and modes for each tool are many:

  1. Count-based methods (e.g. HT-Seq): counts over gene region / counts spanning only exons in gene region / counts over meta-transcripts, etc...
  2. Transcript Quantification methods: bitseq, cufflinks, express, isoem, rsem, etc.
  3. Kmer based approaches (such as those mentioned in (C) of the previous list)

........

N. Hybrid approaches (e.g. use of salmon with aligned reads)

Of course, then there is the issue of how are the results for DE analysis used (downstream): incorporation in a "standard" pipeline (e.g. DESeq, edgeR, limma) or in the accompanying tool (if available): e.g sleuth for kallisto or bitseq's own DE pipeline and so on.

Since I'm currently using (mostly for keeping up to date) many of these tools and combinations, I was wondering what are the weapons of choice for fellow biostars now?

What is currently your favorite combination for fundamental tasks, including:

  • Gene expression quantification for the identification of DE genes
  • Transcript quantification for DE transcripts
  • Source of Reference Transcriptome (which of course affects everything) //

Of course, (as I do), many people have more than one combination to suit different needs (highest accuracy, speed, poorly annotated species, etc).

As a reference, for interested colleagues there are great relevant biostar discussions, such as:

and publications, including the recent:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511015/

And of course the most updated comparisons are in the most recent publications including:

kallisto salmon bitseq RNA-Seq cufflinks • 22k views
ADD COMMENT
18
Entering edit mode
9.3 years ago
  • We generally don't care about transcript-level changes, since they're rarely biologically interpretable (IME, most groups don't want to do the experiments to make them so...so there's no point bothering with the analysis). So it's featureCounts all the way (htseq-count is too slow).
  • For aligners we're mostly using HISAT internally. I had previously used STAR, but HISAT requires a lot less memory and its indices are compatible with bowtie2 (and way smaller, which matters when you make/host the indices used by a whole institute). This isn't really ideal if you want to find novel splice junctions, since it doesn't do 2 passes (like tophat2 or STAR if you tell it to). I hear there's a tophat3 that's supposed to come out at some point which will presumably be the HISAT version of tophat2. I suspect we'll move to that for standard RNAseq alignment then.
  • We have a few projects currently using Salmon and/or Kallisto (this is recent). These are somewhat non-traditional uses of these tools (in one case we're really pushing the bounds of what's reasonable due to having >100 effective isoforms of one gene), but they're working very well so far.
  • We occasionally need to find novel isoforms/genes. We're currently using stringtie for that. It's apparently working well enough (I'm not the one using it).
  • For reference sources, we use Ensembl/Gencode.
  • In a few years I suspect we'll have switched to something like Salmon exclusively, but I'm not comfortable with such a general switch yet.
  • You didn't explicitly ask, but we use DESeq2 in our standard pipeline. The projects using Salmon/Kallisto aren't standard RNAseq, so we end up not having a nice pre-existing R package (though everything is doable in R).
ADD COMMENT
0
Entering edit mode

Really consise response! It's close to what we use. It provides me the opportunity to share some of the topics that consern me the most these days:

Gene Expression: I've started substituting our gene read counting pipeline (HTSeq or featureCounts) with tools that focus on transcripts (isoEM,rsem), since in various publications/simulations we see that the results are better even for simple gene expression estimation.

Aligners: I'm also really interested in HISAT (HISAT2 that is / which one do you currently use?) but the manuscript selected reads with very few mismatches (I think max 3 in each read / which can be a bit low for reads 100nts and up). So I thought to wait for a few more simulations or tophat3.

Kallisto/Salmon: How was your experience with those until now?

DiffExp: There was this big discussion whether use of neg.binomial diffexp algorithms is suitable for tools such as kallisto/salmon and since there is sleuth available, at least for kallisto you're covered..

ADD REPLY
0
Entering edit mode

We'll probably start moving over to HISAT2 next week. I've been dealing with a rather major issue and haven't had time to install it for everyone yet.

Regarding Salmon/Kallisto, we're pretty happy with them so far, but again we're not using these for standard DE stuff, so I can't comment on that.

Yeah, we're normally happy with DESeq2, but Sleuth seems to be the only game in town for normal Kallisto/Salmon style DE analyses.

ADD REPLY
1
Entering edit mode

We have tried HISAT too but we were surprised to find out that for some reason it maps more reads in pseudogene than on non-pseudogene. As far as I understand we were not the only one who have noticed this, see: https://www.twitter.com/notSoJunkDNA/status/577813021176791041

Therefore we are using STAR!

Actually, most of the research groups which I know are very interested in isoforms and they ask for this kind of analysis. For example, as far as I have seen lately all research groups working on prostate cancer ask always what is the expression of AR-V7 (which is an isoform of gene AR) in a given treatment! AR-V7 is even used in clinical trials as a biomarker for predicting the prostate tumor resistance to treatment! Therefore the isoforms/transcripts are now more important than they have ever been!

ADD REPLY
0
Entering edit mode

Does this trend remain with HISAT2? I know that HISAT2 is supposed to contain a number of improvements & bug fixes with respect to the original, and it also has some very interesting new features (e.g. allowing to map directly to the transcriptome annotation using their graph-based alignment framework). I wonder if this trend is a byproduct of the actual approach taken by HISAT, or the result of some bias that may have been fixed in newer versions.

ADD REPLY
1
Entering edit mode

Yes, the same is observed for HISAT2!

ADD REPLY
0
Entering edit mode

Hi Devon & Co. - I just stumbled over the manual for DESeq2 v1.11.23, which mentions the option to use transcript-level abundance information from e.g. kallisto, sailfish or salmon. Curious if anyone has had a chance to try that out and can comment on the performance...? This would be a really nice option (I'm a big "fan" of DESeq2 and its support). Sleuth still seems to be working out some kinks at the moment as well...

ADD REPLY
1
Entering edit mode

Hi jhb80,

Yes, DESeq2 can now use tximport to process transcript-level abundance estimates. This paper demonstrates the benefits of using txp-level estimates to obtain gene-level DE calls using tx-import and Salmon. I've used tx-import and DESeq2 very successfully to this end.

ADD REPLY
0
Entering edit mode

I don't have any personal experience with it, but given the track record of the tximport authors I expect the performance is good.

ADD REPLY
0
Entering edit mode

Hi Devon Could I play Kallisto/Sleuth for gene? I found it is for mapping and quantifying abundances of transcripts. I got the DE transcripts table with Q<0.05 with associated gene name. But as you know, one gene could have multiple transcripts.

the output is sth like

target_id        Q                      ext_gene
ENST00000257570  0                      OASL
ENST00000339275  1.08823325997074e-138  OASL
ENST00000543677  4.50865532164411e-20   OASL

Could I use it for finding DE genes like edgeR/DESeq? how?

ADD REPLY
0
Entering edit mode

There might be a way to use sleuth for gene-level tests, but I'm not familiar enough with it. You might want to have a look at the tximport package (http://f1000research.com/articles/4-1521/v1), which will make your life easier.

ADD REPLY
0
Entering edit mode

Sleuth performs best when used with bootstraps from kallisto. We find significant performance improvements when doing that as opposed to feeding estimated transcript counts into tools such as DESeq2 or edgeR.

ADD REPLY
0
Entering edit mode

Hi Devon I remember I read lots of post talking about we could run Tophat+Cuffdiff (without Cufflinks+Cuffmerge) if we don't want find novel gene (just use reference gonome's GTF file) in analysis. I don't want look at novel gene from merged.gtf, but I want to know if there is big difference between "Tophat+Cufflinks+Cuffmerge+Cuffdiff" and "Tophat+Cuffdiff" . Is that the full steps one is less bias, or actually it doesn't matter between "Tophat+Cufflinks+Cuffmerge+Cuffdiff" and "Tophat+Cuffdiff" ? Thank you!

ADD REPLY
0
Entering edit mode

If not big difference between "Tophat+Cufflinks+Cuffmerge+Cuffdiff" and "Tophat+Cuffdiff" , I 'd like to ignore Cufflinks cause it is more time -consuming. :)

ADD REPLY
0
Entering edit mode

You don't need cufflinks unless you care about novel isoforms.

ADD REPLY
0
Entering edit mode

sleuth does have a way to work at the gene level (https://rawgit.com/pachterlab/sleuth/master/inst/doc/intro.html) furthermore, I used it recently for a few projects filtering only transcript_biotype as protein_coding from biomaRt and it seem to perform really well

ADD REPLY
0
Entering edit mode

Methods like Salmon seem powerful and they overall do a good job, however recently I stumbled upon the (at least Salmon's) behavior on "multi" vs "uniquely" mapping reads (in featureCounts/htseq-count terms, Salmon uses SMEMs/k-mers).

Eventhough there are some parameters to influence salmon like --maxOcc, it is hard to control the exact behavior for multimappings/SMEMs. Especially I was concerned about transcripts/genes that get 0 counts with featureCounts, but high counts/TPM with salmon. As I had further data on histone marks from the same samples, I checked some of these and they showed all signs of silencing marks rather that active marks.

So, to simply treat Salmons estimates as true/absolute expression can be dangerous, most likely due to multi-mappings (of SMEMs), and it is also hard to control that further in Salmon.

This is my experience so far...I have not checked if Kallisto behaves similar and if one can use the bootstrapped abundance estimates to identify transcripts/genes based "only" on multimappings...

I can make another question with some plots if this is of interest...

ADD REPLY
0
Entering edit mode

I would argue that what you're seeing is likely not due to Salmon's mapping-based mode, but rather the result of what you get when you handle multi-mapping reads in a probabilistic inference framework vs. the hard counting that is used by e.g. featureCounts / htseq-counts. Actually, you could check this fairly easily by providing Salmon with a bam file that uses a traditional aligner to map reads to the transcriptome (e.g. Bowtie2 against the transcriptome or STAR against the genome with the appropriate flags to output alignments in transcriptomic coordinates). Though differences are certainly possible, we generally observe only small differences between "alignment-based" inference and inference based on fast mapping techniques.

It is, of course, more difficult to assess the abundances of transcripts that are covered entirely (or almost entirely) by only multi-mapping reads. It's also important, I think, to note that there are likely many cases in the "other direction" (i.e. where a gene or transcript is truly active at non-trivial expression levels, but where counting-based methods miss this since it is driven primarily by multi-mapping reads.

All that being said, I'd be happy to take a look at some of the examples you've come up with, to see if Salmon (and/or other transcript-level abundance estimation methods) are producing estimates different from one would expect given the models.

ADD REPLY
0
Entering edit mode

Hi, It's an old thread but I found a paper talking about the quantification of RNA-seq data: Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. In its conclution part, it said "Importantly, our analysis indicates that the explicit quantification of transcript isoforms leads to more accurate estimates of gene expression levels compared to the ‘count-based’ methods that are broadly used currently". But they evaluated the count-based method with "bedtools intersect", not htseq-count or featureCounts, so I am a bit skeptical of this conclusion. What do you think about this?

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2166 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6