My question is simple. Why is DESeq analysis for RNA-Seq reads considered to be a more reliable method for identifying upregulated/downregulated genes?
My question is simple. Why is DESeq analysis for RNA-Seq reads considered to be a more reliable method for identifying upregulated/downregulated genes?
I don't know if I would call it more reliable, but it does do additional calculations that other methods don't. For one, DESeq2 does something called "shrinking" fold changes of those genes that have low read counts. I don't pretend to understand the math behind it, but in general what it is doing is reducing the fold change of any gene that has low read counts in one or the other or both conditions. Genes with low read counts can have exaggerated fold changes. For example, imagine you have two conditions (each with 3 replicates). In the control for gene A the read counts are 1, 2 and 2, (average 1.67) and in the experiment the read counts are 4, 3 and 4 (average 3.67). Now you also have gene B with control values of 100, 200 and 200 and experimental values of 400, 300 and 400. The calculated fold change for both genes is going to be 2.19, and they may also be significant changes according to the adjusted p-value (I've checked, it happens). However, having a difference of 2 read counts on average is not a lot, and I would not call that differentially expressed unless it is really reproduced in a lot of replicates, thus, DESeq2 shrinks the fold change value accordingly.
I've compared DESeq2 to EdgeR, and while I like both methods, EdgeR does return many significant genes that have exaggerated fold changes due to low read counts (or zero read counts) whereas DESeq2 shrinks the fold change to where it is generally below my cutoff for differential expression. Generally, when I filter for differential expression I use both the padj value and the fold change value. Unless you have a lot of replicates, low fold changes may not be completely accurate. Thus, I use DESeq2 specifically because it adjusts the fold changes of genes with low read counts.
It's not likely that any method is better for all experiments, and methods can be evaluated across many metrics (accuracy in estimating effect size, control of FDR, sensitivity, robust, etc.). Just a few important ways in which even your standard, bulk RNA-seq experiment can differ:
We like to remind users that, with very many replicates and exchangeable samples, rank tests or permutation tests are great because you don't have to make distributional assumptions. It's just that investigators often don't want to spend money on extra experiments when e.g. 3 or 5 replicates per group will suffice in finding the large effects, and allow them to examine more conditions.
With these differences in mind, I'd recommend looking for evaluations by 3rd parties.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
because it has been shown that when you let an octopus decide which genes are significantly deregulated, it can not be reproduced as well as with DESeq [citation needed]
Kudos for Octopus Joke :)
Glad that others gave answers to his question, I was also asking these kinda questions when I was new in bioinformatics
I think what Ido is trying to say is, your question is lacking a second item to compare to, 'better' with respect to what?
btw.: I like the octopus predictor: there was once an octopus with very good results in prediction https://en.wikipedia.org/wiki/Paul_the_Octopus
Fair point. Say you're comparing two different samples, and you're trying to screen for highly upregulated and highly downregulated genes. Would an RPKM ratio of Gene A in sample A and Gene A in sample B sort of analysis or a DESeq analysis that picks genes with lowest padj values be a more reliable method, and why?
Thanks!
If your question ends up becoming, "why should I use a complicated method like DESeq2 or edgeR rather than just doing a T-test on RPKMs?", then have a read through those papers and also the paper on limma. The rationale is described in them.
If no replicates, no method is better. The
padj
values do not make sense without replicates in your data.I think that GFold works quite fine when replicates are not there
# Sorry for Spam #
Yes, I remember that Thomas Muller said somewhere that he wants to eat that Paul the Octopus :)
I'm so glad I don't have to hear about "Orakel Krake Paul" every night on the news any more :)