In my RNAseq data, I have 6 groups (A, B, C, D, E, F). I'd like to compare between A and D. I did DE analysis with limma. I found including all groups in the model and including only the 2 comparison groups gave me significantly different numbers of DEGs (P < 0.05). For example, the first method below detected 6000 DEGs while the 2nd method only detected 900 DEGs. Why is the huge difference between the two methods and which one should I use?
Here are 2 ways to do the DE analysis:
1. Include all the groups and comparisons in the design matrix and contrast matrix:
design <- model.matrix(~ 0 + group, data)
contrast <- makeContrasts(A.vs.B = A-B, A.vs.C = A-C, A.vs.D = A-D, A.vs.E = A-E, A.vs.F = A-F, B.vs.C = B-C, B.vs.D = B-D, B.vs.E = B-E, B.vs.F = B-F, C.vs.D = C-D, C.vs.E = C-E, C.vs.F = C-F, D.vs.E = D-E, D.vs.F = D-F, E.vs.F = E-F)
fit1 <- lmFit(object = data, design = design)
fit2 <- contrasts.fit(fit1, contrast)
fit2 <- eBayes(fit2)
topTable(fit2, coef = A.vs.D)
2. Subset group A and D, and focus only on these 2 groups:
design <- model.matrix(~ 0 + group_sub, data_sub)
contrast <- makeContrasts(A.vs.D = A-D)
I wrote some code specifically to compare the above 2 methods. Here are the code and data.
Thats hard to believe that by subsetting the data by columns to just the samples you are testing will differ in DEG from 6000 vs 900 is OK with folks here. My guess is 6000 DEG is the correct one and you messed up somewhere on the second Limma run.
Davide, we are working on minimal information from the user.
Thank you, Davide and Kevin. I included my script and data in this post. Please take a look at them. I really appreciate it if you can help me.
Thanks, but, there is much code missing from your second 'subgroup' analysis. Also, it would help to show output results tables so that we could see that information. Anything else that you could output would help, too, for example, even something simple like
str(fit2)
from both #1 and #2.In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).
Hi Kevin, I literally included all my data and code in the post. Please take a look at the last sentence in the above post. There is a link to my data and code. Sorry for not making it clear.
Thanks, here is your full script:
Now I know more about your data and workflow.
Why do you feel that you need to adjust for
sex
, too, andRIN
?You should also be filtering by adjusted p-values, not nominal p-values (Edit: according to current default, not specifying an adjustment method will mean that BH is chosen). Your threshold for Log [base 2] fold-change is also very low.
Irrespective of everything, 3 people in this thread have given the general consensus that approach #1 is better, except for this situation:
grayapply2009, if you want a definitive answer, I encourage you to post this question on Bioconductor Support (where I am also Moderator). There, the EdgeR developer is more likely to respond, and it feels like only hearing from him will close this issue off for you. Please link back to this thread when you post there. Thanks!
Thank you, Kevin. I just don't understand why subsetting the data could cause such a huge difference. I think I'll just have to read more about their algorithm.
By the way, p.value = 0.05 in the topTable represents adjusted p < 0.05. FC > 1.2 is the conventional threshold our lab's been using. I didn't adjusted sex because I only compared within females. I adjusted RIN because RIN is significantly correlated with most of the variables.
Why dont you create a single R script that you feed one file, compare the results using the complete set and another file containing the subset.
You can use the same code eg:
design <- model.matrix(~ 0 + group, data)
contrast <- makeContrasts(A.vs.D = A-D)
That would provide you some confidence that your code is the same while the only variable changing is your input file and then you can determine what the DEG while using the code.
That IS what I did. Please take a look at my data and the single script I provided in this post. The link is in the last sentence of the post.
I agree on the points Kevin is making. What are the DEG after using "adjusted p-values" for both sets ?