I am working on analysis for an RNASeq experiment with the end goal of using DESeq2 to generate a list of differentially expressed genes. We have 4 biological replicates for 4 conditions (differing genotypes in mice). We have 50 bp single ended Illumina read sets, on total RNA (they had low input and needed to go with an UltraLow kit thus the total RNA rather than say pA selected/RiboZero).
I have run Salmon on the fastq files using a mouse GRCm38 v3 cDNA all fasta file as the source to build a Salmon index file, and have read counts at transcript level for every data set. We have compared this at first glance to a prior dataset from a similar prior experiment (I don't know much about the analysis for this prior exit) and the top expressed genes by count show anticipated overlap- literally just looking by eye) so things superficially seem to be heading in the right direction.
Here is the meat of the question, the manual for DESeq(2) is quite clear that they require raw counts as input, and they use a description of read counts mapped to gene i in sample j.
Salmon (and other similar tools, Sailfish, Kallisto) work at a transcript level counting mechanism- it would seem to be trivial to sum to the gene level, and I am having trouble convincing myself that this would be an error. E.G.- presumably any read with mapping quality sufficient to be called as mapping to, for instance, transcript 1 of gene X, will show up as a count there, whereas another read mapping to transcript 2 of gene X will show up under that transcript, but never be double counted. Thus, simply summing transcript counts to gene level should be straight forward and not introduce error.
However, I wonder if I am glossing over something (or simply naive about this) that would bring to question the validity of this approach. The Salmon developers feel this is a valid even superior approach and state exactly this, but I find other discussions that seem to question this general approach, e.g. summing transcript to gene level counts - I note Kallisto and Salmon appear to operate using approximately similar approaches and thus discussions relevant to one are probably relevant to the other.
Thoughts and discussion would be very helpful to me. I am happy to start over using a different mapping approach and have used Bowtie, tophat, cufflinks etc, but am very much liking the simplicity of approach in mapping only to transcripts and the speed with which this can be accomplished.
Thanks in advance for illumination and insight. I may (almost certainly) have glossed over or misrepresented a few things along the way in forming this question as this is a relatively new field to me.
I will look into limma/voom but had not/have not used these tools.
Both the Kallisto and Salmon manuals state that they provide estimated read counts, and in the case of Salmon it is directly stated that you can sum these to gene level for use in DESeq analysis. I fully believe them, but as I mentioned, I currently lack the technical sophistication to understand any differences, or potential pitfalls, to this approach vs prior workflows.
It is because of the Salmon developers assertion that this summation is appropriate that I choose to implement Salmon in our workflow- it seems wonderfully suited for our analysis-
I was rather hoping someone would pop up who could provide a bit of an endorsement to the practice as its a newer field for me and the number of "competing" packages is quite bewildering. Further, since both Salmon and Kallisto seem to generate "equivalent" output I figured any discussion herein would help others starting (like me) with these tools to understand exactly what they offer.
To me, they look like they have very much on offer, incredible speed, good accuracy, and direct output at read level counting. I guess I started this post to gain a better understanding of these tools and their use in DESeq analyses- as they seem at first glance to be ideal for it-
Estimated counts might work with DESeq, but not DESeq2. You can round them, but then you're losing information. You could still use edgeR or limma/voom, since those won't typically complain about estimated counts.
Anyway, at least in theory what you're proposing should work. There's still a fair bit of validation to be done with some of these tools, but I think the general consensus is that Kallisto in particular looks quite promising.
"Kallisto in particular looks quite promising." --- Ouch ;P. While I agree that Kalliso is a very nice tool, hopefully, we'll be able to convince you, in the upcoming Salmon manuscript, that Salmon is even better. Specifically, it is able to account for quite a number of known (and often significant) experimental biases & factors that Kallisto's simplified model does not allow. Anyway, this isn't a swipe at Kallisto, I think they're both great tools :).
Hi Rob, is there a Salmon preprint out anywhere or any other sort of method explanation? I'd like to give a little journal club talk at the end of August on Kallisto/Sailfish and it'd be great to add some good information on Salmon.
Hi Devon,
We're finishing up the pre-print right now. It will be up on bioRxiv soon (I'll ping you once it appears).
Please let us know- here or wherever. Thanks!
We submitted the pre-print to BioRxiv today, so it should show up soon. Might make a post or whatnot when it drops ;P. Thanks for your interest!
It's now available at http://biorxiv.org/content/early/2015/06/27/021592.
Awesome, thanks!
A bit late but currently I want to do the same thing (although I have tried Salmon and DESeq2 for naother data several months ago). From what I know, there is an option in Salmon that will generate gene level read counts automatically by providing gene map, a file with 2 columns for gene id and transcript id. In this case, I think you needn't to sum it manually from transcript read count to gene level read count. For input to DESeq2, there is no other choices but to round it to integer and I did that. In general, I think it is a simple way and for me, more or less can show the DE genes. If I compare with the visualization of the aligned reads, I think the DE genes from Salmon-DESeq2 workflow more or less correct.
Yes exactly thats what I do.
Don't kallisto/sailfish/salmon produce estimated counts/TPM/etc. rather than integer counts? Even summing those to gene-level values won't produce DESeq2 compatible input. Have you considered limma/voom?
Clarification-
My reading of the manual for DESeq2 seems at odds with the idea that summation of "estimated" read counts would not be appropriate. They state that you cannot use "normalized" read counts, which is not -I think- the output of Kallisto nor Salmon (for which I am certain its not normalized). The word "estimate" for read counts that I have adopted from the developers seems to be intended to address the fact that they (Salmon for sure, Kallisto I think) allow multi mapping reads, and thus a transcript can be assigned some fraction of a read, so you get a floating point- but the data is not normalized- its still a direct count. My understanding of the word normalized implies a quite different treatment of the data. Thus I don't understand why you suggest we should not use DESeq2 here... can you expound/clarify?
Although I agree your introducing "error" in truncating to an integer, our read counts for highly expressed genes are tens of thousands, I do not see how an integer truncation is anything to worry about? Ideally one would sum as floating point and truncate to integer after the summation, so at most, your introducing a 0.5 count error and usually less... this seems trivial. Sure, for those genes that are basically not expressed one might induce a more significant error but I thought DESeq2 was designed to eliminate the problems with prior approaches where small count sizes lead to frequent diff calls- so I was rather hoping that DESeq2 would actually be better suited to handle this as its model assumes or maybe I should say accounts for the noise inherent with low count samples, and although we would be adding to this noise it should not be significant but for those genes with near zero counts.
In any case, I sincerely hope we will not end up having to worry about pursuing genes with extremely small read counts that ended up being called as differentially expressed- as that result would tend to suggest biology is telling us that our groupings are not so different and the experiment was somewhat useless in its result.
This isn't covered in the manual. There is no way to get DESeq2 to accept estimated counts without rounding them to integers. This is purely an implementational constraint, though getting around it would prove to be a royal pain. You can try this, but you'll just get errors.
I apologize I think were both saying the same thing.
By summing estimated transcript counts to the gene level and then truncating to an integer I can generate data that DESeq2 should be fine with- or at least will accept as properly formatted input- are you suggesting a deeper problem with the proposed summation or were you thinking I was going to try to feed DESeq a bunch of floating point numbers or am I just entirely needing a brain boost and not seeing the point here.
Can one just round the transcripts counts and then use DESeq2 with the transcripts instead of the genes? I think gene expression differences mask transcript expression differences and this might be important to consider. Am I wrong?