As always there are currently more pipelines and tools available for quantification of gene/transcript expression than ever before.
What has changed is that there are many different types of algorithms and strategies that affect how data are aligned (alignment / pseudoalignment / lightweight alignment, etc) and analyzed (results compatible or not with tools for general use).
For instance, we have numerous crossroads, with significant impact:
A. Gene expression: Gene counts vs Transcript-level quantification (and summation for each gene)
B. Transcripts: De novo vs reference transcriptome..
C. Alignment vs the new and super-fast kmer approaches (Sailfish, Kallisto, RNA-Skim, Salmon)
and so on..
The available tools and modes for each tool are many:
- Count-based methods (e.g. HT-Seq): counts over gene region / counts spanning only exons in gene region / counts over meta-transcripts, etc...
- Transcript Quantification methods: bitseq, cufflinks, express, isoem, rsem, etc.
- Kmer based approaches (such as those mentioned in (C) of the previous list)
........
N. Hybrid approaches (e.g. use of salmon with aligned reads)
Of course, then there is the issue of how are the results for DE analysis used (downstream): incorporation in a "standard" pipeline (e.g. DESeq, edgeR, limma) or in the accompanying tool (if available): e.g sleuth for kallisto or bitseq's own DE pipeline and so on.
Since I'm currently using (mostly for keeping up to date) many of these tools and combinations, I was wondering what are the weapons of choice for fellow biostars now?
What is currently your favorite combination for fundamental tasks, including:
- Gene expression quantification for the identification of DE genes
- Transcript quantification for DE transcripts
- Source of Reference Transcriptome (which of course affects everything) //
Of course, (as I do), many people have more than one combination to suit different needs (highest accuracy, speed, poorly annotated species, etc).
As a reference, for interested colleagues there are great relevant biostar discussions, such as:
- Transcript to gene level count for DEseq(2) use- Salmon/Sailfish/Kallisto etc. or
- Can Kallisto be followed by DESeq, EdgeR or Cuffdiff? on whether you can/should use standard DE tools following a kmer expression estimation approach or
- Using rna-seq, why map reads to the genome but calculate gene expression by aligning to the transcriptome? on what is lost when aligning against the transcriptome
and publications, including the recent:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511015/
And of course the most updated comparisons are in the most recent publications including:
- Sailfish: http://www.nature.com/nbt/journal/v32/n5/full/nbt.2862.html
- RNA-Skim: http://bioinformatics.oxfordjournals.org/content/30/12/i283.full
- Kallisto: http://arxiv.org/abs/1505.02710v2 [v2. pre-submission]
- Salmon: http://biorxiv.org/content/early/2015/06/27/021592 [pre-submission]
- BitSeq: I'll throw in the comparisons performed in the upcoming BitSeq paper (rsem, bitseq, express, callisto, casper, sailfish, cufflinks and tigar2) http://arxiv.org/pdf/1412.5995v3.pdf [pre-submission]
Really consise response! It's close to what we use. It provides me the opportunity to share some of the topics that consern me the most these days:
Gene Expression: I've started substituting our gene read counting pipeline (HTSeq or featureCounts) with tools that focus on transcripts (isoEM,rsem), since in various publications/simulations we see that the results are better even for simple gene expression estimation.
Aligners: I'm also really interested in HISAT (HISAT2 that is / which one do you currently use?) but the manuscript selected reads with very few mismatches (I think max 3 in each read / which can be a bit low for reads 100nts and up). So I thought to wait for a few more simulations or tophat3.
Kallisto/Salmon: How was your experience with those until now?
DiffExp: There was this big discussion whether use of neg.binomial diffexp algorithms is suitable for tools such as kallisto/salmon and since there is sleuth available, at least for kallisto you're covered..
We'll probably start moving over to HISAT2 next week. I've been dealing with a rather major issue and haven't had time to install it for everyone yet.
Regarding Salmon/Kallisto, we're pretty happy with them so far, but again we're not using these for standard DE stuff, so I can't comment on that.
Yeah, we're normally happy with DESeq2, but Sleuth seems to be the only game in town for normal Kallisto/Salmon style DE analyses.
We have tried HISAT too but we were surprised to find out that for some reason it maps more reads in pseudogene than on non-pseudogene. As far as I understand we were not the only one who have noticed this, see: https://www.twitter.com/notSoJunkDNA/status/577813021176791041
Therefore we are using STAR!
Actually, most of the research groups which I know are very interested in isoforms and they ask for this kind of analysis. For example, as far as I have seen lately all research groups working on prostate cancer ask always what is the expression of AR-V7 (which is an isoform of gene AR) in a given treatment! AR-V7 is even used in clinical trials as a biomarker for predicting the prostate tumor resistance to treatment! Therefore the isoforms/transcripts are now more important than they have ever been!
Does this trend remain with HISAT2? I know that HISAT2 is supposed to contain a number of improvements & bug fixes with respect to the original, and it also has some very interesting new features (e.g. allowing to map directly to the transcriptome annotation using their graph-based alignment framework). I wonder if this trend is a byproduct of the actual approach taken by HISAT, or the result of some bias that may have been fixed in newer versions.
Yes, the same is observed for HISAT2!
Hi Devon & Co. - I just stumbled over the manual for DESeq2 v1.11.23, which mentions the option to use transcript-level abundance information from e.g. kallisto, sailfish or salmon. Curious if anyone has had a chance to try that out and can comment on the performance...? This would be a really nice option (I'm a big "fan" of DESeq2 and its support). Sleuth still seems to be working out some kinks at the moment as well...
Hi jhb80,
Yes, DESeq2 can now use tximport to process transcript-level abundance estimates. This paper demonstrates the benefits of using txp-level estimates to obtain gene-level DE calls using tx-import and Salmon. I've used tx-import and DESeq2 very successfully to this end.
I don't have any personal experience with it, but given the track record of the
tximport
authors I expect the performance is good.Hi Devon Could I play Kallisto/Sleuth for gene? I found it is for mapping and quantifying abundances of transcripts. I got the DE transcripts table with Q<0.05 with associated gene name. But as you know, one gene could have multiple transcripts.
the output is sth like
Could I use it for finding DE genes like edgeR/DESeq? how?
There might be a way to use sleuth for gene-level tests, but I'm not familiar enough with it. You might want to have a look at the tximport package (http://f1000research.com/articles/4-1521/v1), which will make your life easier.
Sleuth performs best when used with bootstraps from kallisto. We find significant performance improvements when doing that as opposed to feeding estimated transcript counts into tools such as DESeq2 or edgeR.
Hi Devon I remember I read lots of post talking about we could run Tophat+Cuffdiff (without Cufflinks+Cuffmerge) if we don't want find novel gene (just use reference gonome's GTF file) in analysis. I don't want look at novel gene from merged.gtf, but I want to know if there is big difference between "Tophat+Cufflinks+Cuffmerge+Cuffdiff" and "Tophat+Cuffdiff" . Is that the full steps one is less bias, or actually it doesn't matter between "Tophat+Cufflinks+Cuffmerge+Cuffdiff" and "Tophat+Cuffdiff" ? Thank you!
If not big difference between "Tophat+Cufflinks+Cuffmerge+Cuffdiff" and "Tophat+Cuffdiff" , I 'd like to ignore Cufflinks cause it is more time -consuming. :)
You don't need cufflinks unless you care about novel isoforms.
sleuth does have a way to work at the gene level (https://rawgit.com/pachterlab/sleuth/master/inst/doc/intro.html) furthermore, I used it recently for a few projects filtering only
transcript_biotype
asprotein_coding
from biomaRt and it seem to perform really wellMethods like Salmon seem powerful and they overall do a good job, however recently I stumbled upon the (at least Salmon's) behavior on "multi" vs "uniquely" mapping reads (in featureCounts/htseq-count terms, Salmon uses SMEMs/k-mers).
Eventhough there are some parameters to influence salmon like --maxOcc, it is hard to control the exact behavior for multimappings/SMEMs. Especially I was concerned about transcripts/genes that get 0 counts with featureCounts, but high counts/TPM with salmon. As I had further data on histone marks from the same samples, I checked some of these and they showed all signs of silencing marks rather that active marks.
So, to simply treat Salmons estimates as true/absolute expression can be dangerous, most likely due to multi-mappings (of SMEMs), and it is also hard to control that further in Salmon.
This is my experience so far...I have not checked if Kallisto behaves similar and if one can use the bootstrapped abundance estimates to identify transcripts/genes based "only" on multimappings...
I can make another question with some plots if this is of interest...
I would argue that what you're seeing is likely not due to Salmon's mapping-based mode, but rather the result of what you get when you handle multi-mapping reads in a probabilistic inference framework vs. the hard counting that is used by e.g. featureCounts / htseq-counts. Actually, you could check this fairly easily by providing Salmon with a bam file that uses a traditional aligner to map reads to the transcriptome (e.g. Bowtie2 against the transcriptome or STAR against the genome with the appropriate flags to output alignments in transcriptomic coordinates). Though differences are certainly possible, we generally observe only small differences between "alignment-based" inference and inference based on fast mapping techniques.
It is, of course, more difficult to assess the abundances of transcripts that are covered entirely (or almost entirely) by only multi-mapping reads. It's also important, I think, to note that there are likely many cases in the "other direction" (i.e. where a gene or transcript is truly active at non-trivial expression levels, but where counting-based methods miss this since it is driven primarily by multi-mapping reads.
All that being said, I'd be happy to take a look at some of the examples you've come up with, to see if Salmon (and/or other transcript-level abundance estimation methods) are producing estimates different from one would expect given the models.
Hi, It's an old thread but I found a paper talking about the quantification of RNA-seq data: Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. In its conclution part, it said "Importantly, our analysis indicates that the explicit quantification of transcript isoforms leads to more accurate estimates of gene expression levels compared to the ‘count-based’ methods that are broadly used currently". But they evaluated the count-based method with "bedtools intersect", not htseq-count or featureCounts, so I am a bit skeptical of this conclusion. What do you think about this?
Thanks