Transcript to gene level count for DEseq(2) use- Salmon/Sailfish/Kallisto etc.
2
56
Entering edit mode
9.5 years ago
mhockin ▴ 610

I am working on analysis for an RNASeq experiment with the end goal of using DESeq2 to generate a list of differentially expressed genes. We have 4 biological replicates for 4 conditions (differing genotypes in mice). We have 50 bp single ended Illumina read sets, on total RNA (they had low input and needed to go with an UltraLow kit thus the total RNA rather than say pA selected/RiboZero).

I have run Salmon on the fastq files using a mouse GRCm38 v3 cDNA all fasta file as the source to build a Salmon index file, and have read counts at transcript level for every data set. We have compared this at first glance to a prior dataset from a similar prior experiment (I don't know much about the analysis for this prior exit) and the top expressed genes by count show anticipated overlap- literally just looking by eye) so things superficially seem to be heading in the right direction.

Here is the meat of the question, the manual for DESeq(2) is quite clear that they require raw counts as input, and they use a description of read counts mapped to gene i in sample j.

Salmon (and other similar tools, Sailfish, Kallisto) work at a transcript level counting mechanism- it would seem to be trivial to sum to the gene level, and I am having trouble convincing myself that this would be an error. E.G.- presumably any read with mapping quality sufficient to be called as mapping to, for instance, transcript 1 of gene X, will show up as a count there, whereas another read mapping to transcript 2 of gene X will show up under that transcript, but never be double counted. Thus, simply summing transcript counts to gene level should be straight forward and not introduce error.

However, I wonder if I am glossing over something (or simply naive about this) that would bring to question the validity of this approach. The Salmon developers feel this is a valid even superior approach and state exactly this, but I find other discussions that seem to question this general approach, e.g. summing transcript to gene level counts - I note Kallisto and Salmon appear to operate using approximately similar approaches and thus discussions relevant to one are probably relevant to the other.

Thoughts and discussion would be very helpful to me. I am happy to start over using a different mapping approach and have used Bowtie, tophat, cufflinks etc, but am very much liking the simplicity of approach in mapping only to transcripts and the speed with which this can be accomplished.

Thanks in advance for illumination and insight. I may (almost certainly) have glossed over or misrepresented a few things along the way in forming this question as this is a relatively new field to me.

DESeq Kallisto RNASeq Salmon DESeq2 • 38k views
ADD COMMENT
2
Entering edit mode

I will look into limma/voom but had not/have not used these tools.

Both the Kallisto and Salmon manuals state that they provide estimated read counts, and in the case of Salmon it is directly stated that you can sum these to gene level for use in DESeq analysis. I fully believe them, but as I mentioned, I currently lack the technical sophistication to understand any differences, or potential pitfalls, to this approach vs prior workflows.

It is because of the Salmon developers assertion that this summation is appropriate that I choose to implement Salmon in our workflow- it seems wonderfully suited for our analysis-

I was rather hoping someone would pop up who could provide a bit of an endorsement to the practice as its a newer field for me and the number of "competing" packages is quite bewildering. Further, since both Salmon and Kallisto seem to generate "equivalent" output I figured any discussion herein would help others starting (like me) with these tools to understand exactly what they offer.

To me, they look like they have very much on offer, incredible speed, good accuracy, and direct output at read level counting. I guess I started this post to gain a better understanding of these tools and their use in DESeq analyses- as they seem at first glance to be ideal for it-

ADD REPLY
1
Entering edit mode

Estimated counts might work with DESeq, but not DESeq2. You can round them, but then you're losing information. You could still use edgeR or limma/voom, since those won't typically complain about estimated counts.

Anyway, at least in theory what you're proposing should work. There's still a fair bit of validation to be done with some of these tools, but I think the general consensus is that Kallisto in particular looks quite promising.

ADD REPLY
1
Entering edit mode

"Kallisto in particular looks quite promising." --- Ouch ;P. While I agree that Kalliso is a very nice tool, hopefully, we'll be able to convince you, in the upcoming Salmon manuscript, that Salmon is even better. Specifically, it is able to account for quite a number of known (and often significant) experimental biases & factors that Kallisto's simplified model does not allow. Anyway, this isn't a swipe at Kallisto, I think they're both great tools :).

ADD REPLY
0
Entering edit mode

Hi Rob, is there a Salmon preprint out anywhere or any other sort of method explanation? I'd like to give a little journal club talk at the end of August on Kallisto/Sailfish and it'd be great to add some good information on Salmon.

ADD REPLY
1
Entering edit mode

Hi Devon,

We're finishing up the pre-print right now. It will be up on bioRxiv soon (I'll ping you once it appears).

ADD REPLY
0
Entering edit mode

Please let us know- here or wherever. Thanks!

ADD REPLY
1
Entering edit mode

We submitted the pre-print to BioRxiv today, so it should show up soon. Might make a post or whatnot when it drops ;P. Thanks for your interest!

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Awesome, thanks!

ADD REPLY
2
Entering edit mode

A bit late but currently I want to do the same thing (although I have tried Salmon and DESeq2 for naother data several months ago). From what I know, there is an option in Salmon that will generate gene level read counts automatically by providing gene map, a file with 2 columns for gene id and transcript id. In this case, I think you needn't to sum it manually from transcript read count to gene level read count. For input to DESeq2, there is no other choices but to round it to integer and I did that. In general, I think it is a simple way and for me, more or less can show the DE genes. If I compare with the visualization of the aligned reads, I think the DE genes from Salmon-DESeq2 workflow more or less correct.

ADD REPLY
0
Entering edit mode

Yes exactly thats what I do.

ADD REPLY
0
Entering edit mode

Don't kallisto/sailfish/salmon produce estimated counts/TPM/etc. rather than integer counts? Even summing those to gene-level values won't produce DESeq2 compatible input. Have you considered limma/voom?

ADD REPLY
0
Entering edit mode

Clarification-

My reading of the manual for DESeq2 seems at odds with the idea that summation of "estimated" read counts would not be appropriate. They state that you cannot use "normalized" read counts, which is not -I think- the output of Kallisto nor Salmon (for which I am certain its not normalized). The word "estimate" for read counts that I have adopted from the developers seems to be intended to address the fact that they (Salmon for sure, Kallisto I think) allow multi mapping reads, and thus a transcript can be assigned some fraction of a read, so you get a floating point- but the data is not normalized- its still a direct count. My understanding of the word normalized implies a quite different treatment of the data. Thus I don't understand why you suggest we should not use DESeq2 here... can you expound/clarify?

Although I agree your introducing "error" in truncating to an integer, our read counts for highly expressed genes are tens of thousands, I do not see how an integer truncation is anything to worry about? Ideally one would sum as floating point and truncate to integer after the summation, so at most, your introducing a 0.5 count error and usually less... this seems trivial. Sure, for those genes that are basically not expressed one might induce a more significant error but I thought DESeq2 was designed to eliminate the problems with prior approaches where small count sizes lead to frequent diff calls- so I was rather hoping that DESeq2 would actually be better suited to handle this as its model assumes or maybe I should say accounts for the noise inherent with low count samples, and although we would be adding to this noise it should not be significant but for those genes with near zero counts.

In any case, I sincerely hope we will not end up having to worry about pursuing genes with extremely small read counts that ended up being called as differentially expressed- as that result would tend to suggest biology is telling us that our groupings are not so different and the experiment was somewhat useless in its result.

ADD REPLY
0
Entering edit mode

This isn't covered in the manual. There is no way to get DESeq2 to accept estimated counts without rounding them to integers. This is purely an implementational constraint, though getting around it would prove to be a royal pain. You can try this, but you'll just get errors.

ADD REPLY
0
Entering edit mode

I apologize I think were both saying the same thing.

By summing estimated transcript counts to the gene level and then truncating to an integer I can generate data that DESeq2 should be fine with- or at least will accept as properly formatted input- are you suggesting a deeper problem with the proposed summation or were you thinking I was going to try to feed DESeq a bunch of floating point numbers or am I just entirely needing a brain boost and not seeing the point here.

ADD REPLY
0
Entering edit mode

Can one just round the transcripts counts and then use DESeq2 with the transcripts instead of the genes? I think gene expression differences mask transcript expression differences and this might be important to consider. Am I wrong?

ADD REPLY
18
Entering edit mode
9.5 years ago
Michael Love ★ 2.6k

We use the requirement for discrete counts as input so that users don't try to input normalized counts (for example dividing out library size), which would be inappropriate and my main concern here. If you can input a matrix which is close to a count of fragments which uniquely align to each gene (so no double counting of fragments from the different transcripts of a gene, and no proportioning of fragments which can align to different genes), then the methods should apply, but I haven't had time to test this yet so it's not included in any workflows.

Update [12/7/2015]: We now have a suggested workflow for importing transcript estimates from Sailfish, Salmon, or kallisto and summarizing to the gene-level for use with count-based tools like edgeR, DESeq2, limma-voom. The package is on Bioconductor and has a vignette with example code:

http://bioconductor.org/packages/tximport

ADD COMMENT
2
Entering edit mode

Thanks for this answer, this is more or less what I had presumed. Even better, DESeq2 appears to work well with Salmon-

I wrote a short scrip to sum to gene level from the Salmon output and generated a bunch of files appropriate for R-import and managed to get everything into DESeq2 for analysis of the 4 data sets (4 samples each). Indeed DESeq2 appears to have done perfectly well with this input. I re-created pretty much every figure from the manual so as to ensure that my data was indeed looking like a "normal" DESeq2 data set and all was well. Salmon appears to produce very usable data for this application and I have plans to incorporate this into another analysis that's coming in the future.

ADD REPLY
4
Entering edit mode

Would you be willing to share this script with programming novices such as myself? Sounds exactly like the thing I need, wanting to do an RNAseq analysis with Salmon and DEseq2. I'm okay using R, just everything that only runs on Linux still freaks me out a bit, so if this script exists already... would be greatly appreciated!

ADD REPLY
1
Entering edit mode

+1 on sharing the script, @mhockin

ADD REPLY
0
Entering edit mode

Sorry I could not find the script link anymore @mhockin

ADD REPLY
14
Entering edit mode
9.2 years ago
Lior Pachter ▴ 700

There is some confusion in the answers to this question that hopefully I can clarify with the three comments below:

  1. kallisto produces estimates of transcript level counts, and therefore to obtain an estimate of the number of reads from a gene the correct thing to do is to sum the estimated counts from the constituent transcripts of that gene. Of note in the language above is the word "estimate", which is necessary because in many cases reads cannot be mapped uniquely to genes. However insofar as obtaining a good estimate, the approach of kallisto (and before it Cufflinks, RSEM, eXpress and other "transcript level quantification tools") is superior to naïve "counting" approaches for estimating the number of reads originating from a gene. This point has been argued in many papers; among my own papers it is most clearly explained and demonstrated in Trapnell et al. 2013.

  2. Although estimated counts for a gene can be obtained by summing the estimated counts of the constituent transcripts from tools such as kallisto, and the resulting numbers can be rounded to produce integers that are of the correct format for tools such as DESeq, the numbers produced by such an approach do not satisfy the distributional assumptions made in DESeq and related tools. For example, in DESeq2, counts are modeled "as following a negative binomial distribution". This assumption is not valid when summing estimated counts of transcripts to obtain gene level counts, hence the justified concern of Michael Love that plugging in sums of estimated transcript counts could be problematic for DESeq2. In fact, even the estimated transcript counts themselves are not negative binomial distributed, and therefore also those are not appropriate for plugging into DESeq2. His concern is equally valid with many other "count based" differential expression tools.

  3. Fortunately there is a solution for performing valid statistical testing of differential abundance of individual transcripts, namely the method implemented in sleuth. The approach is described here. To test for differential abundance of genes, one must first address the question of what that means. E.g. is a gene differential if at least one isoform is? or if all the isoforms are? The tests of sleuth are performed at the granularity of transcripts, allowing for downstream analysis that can capture the varied questions that might make biological sense in specific contexts.

In summary, please do not plug in rounded estimates of gene counts from kallisto into DESeq2 and other tools. While it is technically possible, it is not statistically advisable. Instead, you should use tools that make valid distributional assumptions about the estimates.

ADD COMMENT
10
Entering edit mode

I think there is still confusion. Here are my (counter?) views:

  1. I think we all agree; proper portioning of reads to transcripts (assuming the annotation is reasonably accurate) is better than raw gene-level union counting. The claim of "superior" seems to be mostly in hypothetical/simulation situations, although I've now experienced one convincing example (unpublished data) where it makes a difference. Across a small number of datasets, we found very little difference between proper and raw counting (see Figure 4 of http://goo.gl/ySQTlu), but I'd be glad if someone can point me to good examples where it makes a difference.

  2. Transcript-level counts are still count-like and from what we can tell, applying edgeR/DESeq2 to transcript-level counts (or transcript-level estimates properly aggregrated to the gene-level) is ok. There may be room for improvement, but the mean-variance or dispersion-mean relationships are basically the same, P-value distributions look healthy and I don't see much evidence for the "assumption is not valid" claim above. Would be happy to see this evidence if there is some. I see it as a "all models are wrong, but some are useful" situation and vanilla NB is still useful here. More discussion @F1000: http://f1000research.com/articles/4-1521/v1

  3. I agree that sleuth is a valid solution. The main potential disadvantage that I can see is that it models the log of the expression (this introduces a different mean-variance relationship that needs to be modeled) and since expression is count-like, I would prefer to keep it in a count model. We and others have observed that this log-transform approach (e.g., limma-voom) loses a bit of sensitivity (e.g., http://nar.oxfordjournals.org/content/42/11/e91/F5.expansion.html); certainly not a crisis, but I think you can do better. As Lior mentions, it's important to define the question. In general, we favour differential expression analyses at the gene-level and observe that DE "at granularity of transcripts" often gets cast back to the gene-level, so you might as well operate at the gene-level from the outset. Of course, some situations do beg for transcript-level analyses. More discussion in our paper: http://f1000research.com/articles/4-1521/v1

In summary, feel free to plug in proper transcript- or gene-level count-like estimates into edgeR/DESeq2. I suspect there is more to come.

ADD REPLY
2
Entering edit mode

Who's right? Mark says it's ok but Lior says it's not????

ADD REPLY
2
Entering edit mode

It is a bit more complex than who is right or wrong buddy.

ADD REPLY
0
Entering edit mode

buddy

buddy

ADD REPLY

Login before adding your answer.

Traffic: 985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6