(cross-posted on seqanswers a few days ago -- http://seqanswers.com/forums/showthread.php?t=49062, please let me know if posting here now is a violation of etiquette!)
Hi all,
I'd like to let you know about Salmon, a new tool we've been developing for isoform-level quantification from RNA-seq data. It is, conceptually, the successor to our Sailfish software, but boasts a significant number of improvements and relies on a very different methodology. It maintains or improves upon the main strengths of Sailfish (i.e. it is as-fast to faster in most situations, and requires substantially less memory --- especially for large transcriptomes) while providing a number of additional benefits. For example, it eliminates the need to build a parameter-dependent (k-mer size) index, and makes much better use of paired-end data and longer reads --- in the testing we and others have done so far, it appears to be very accurate even in complex situations. It also provides alignment-based and alignment-free quantification modes to suit users with both needs.
Salmon is fully open source, and is currently being developed on the develop branch of the Sailfish GitHub repository. The documentation is available via ReadTheDocs, the latest binaries are available here, and we welcome questions and discussion on the Sailfish Google Group. The manuscript is in preparation, but we already have a number of people testing and using the software, and we'd like to get input and feedback from the community as we finish the manuscript. So please, give salmon a try --- it's tasty ;).
--Rob
Hi Karl,
Salmon (and Sailfish) estimate read counts at the transcript level by "soft-assigning" assigning multi-mapping reads to different possible transcripts of origin based on a probabilistic model. You can find a detailed description of how "estimated" read counts are derived in the Salmon pre-print.
The key here is that multi-mapping reads are never "double-counted" - the total sum of estimated reads will equal the total number of aligned reads, and the estimated read count is an estimate of the actual number of reads originating from each transcript (and accounting for multimapping). There are also other, newer tools, that are designed to deal with such read counts directly (which can be used with Kallisto, Salmon & Sailfish).
Thanks, Nico! As Colin Dewey mentions in this thread on the RSEM user group, TPM is a relative measure of abundance, and can be used to assess relative abundances across samples. However, as he suggests, this is rarely what one has in mind when he wishes to do across-sample comparison. For that, you'll need an extra-level of normalization --- people have considered many approaches. This recent bioRxiv paper and the Dilles et al. paper referenced therein, find that the best-performing normalization methods seem to be TMM and the DeSeq normalization.
So to summarize, you'd say that applying a scale normalization such as TMM, directly on the TPM values produced by Salmon, is the way to go prior to differential gene expression analysis?