Hi!
Recently I'm working on an R pipeline for RNA-seq analysis. My goal is to create a script that compares my samples to each other and compute the DE.
So far I created my script and its working well. But when I implemented loops which would make my life easier, I noticed that the result are different. To be precise: First I did the comparisons sample pair by pair, than with the loops I loaded in all my data and made the comparison sample pair by pair.
What I noticed is everything goes as it should be, than when I reach the part where I call the
fit <- glmQLFit(dge,dge$design,robust = TRUE)
function, the a little difference shows up:
The first pic shows when ALL the samples loaded in, the second is when only 2 group (6 samples) is loaded in. After that when I want to make a
res <- glmQLFTest(fit,contrast = c(-1, 0,1,0))
it obviously shows different result. Edger uses different calculations based on the loaded samples? Also it makes the DE totally different when I try to visualize it. For the 2 example:
One plot shows a lot of DE genes (this is where only 2 groups (6 samples) were compared), and the other only shows 2 (this is where all the samples processed together).
My code looks the same even if I use 2 group samples or multiple group samples.
What could be the problem? Thank you for your time and help!
If I understand correctly, your question is basically that you get a slightly different result if you calculate DE using all your data versus subsets of your data. The reason the results will not be identical is because the variance estimate for a gene will be much more informed in a larger data set, across more conditions or samples, than in a smaller data set. It's not so much a problem as a feature. Better estimates of variance allow better evaluation of significance.
My code :