Hello community,
I'm currently performing DEG in a datasets of single cell using the "pseudo-bulk" approach, between two groups of individuals:
I tried to use both DESeq2 and Edge R (also MAST with random effects for individual, but I was not able to run it for some cell-types).
I've processes the data the same for both methods (by using the sum of reads across cell-types), and ran both methods according to the manual.
However, when I compared the results, DESeq2 returned a dozen genes as DEG, while edgeR returned ~ 12000 (across all cell-types).
My question is why there is this big difference in results.
Here is a snipet of the code (I want DEG in group, controling for AGE, DISEASE and SEX):
clusters <- unique(metadt$Manuscript_Identity)
#For each celltype
#Format data
cluster_metadata <- metadt[which(metadt$Manuscript_Identity == clusters[x]), ]
rownames(cluster_metadata) <- cluster_metadata$Subject_Identity
cluster_metadata$AgeScaled = scale(as.numeric(cluster_metadata$Age),center=0)[,1]
cluster_counts <- pb[[clusters[x]]]
#Run DESeq2
dds <- DESeqDataSetFromMatrix(cluster_counts, colData = cluster_metadata, design = ~ Sex + AgeScaled + Disease + group)
dds_lrt <- DESeq(dds, test="LRT", reduced = ~ Sex + AgeScaled + Disease)
res_LRT <- results(dds_lrt, contrast = c("group", "Y", "N"), alpha = 0.05)
sigLRT_genes <- res_LRT_tb %>%
filter(padj < 0.05) %>% arrange(padj)
#Run EdgeR
cluster_metadata$group <- relevel(cluster_metadata$group, ref = "Y")
design = model.matrix(~ Sex + AgeScaled + Disease + group, data = droplevels(cluster_metadata))
y = DGEList(counts = cluster_counts, group = cluster_metadata$group)
params_method = 'TMM'
y = calcNormFactors(y, method = params_method)
y = estimateGLMRobustDisp(y, design)
fit = glmFit(y, design = design)
test = glmLRT(fit)
res_LRT_tb = topTags(test, n = Inf) %>% as.data.frame() %>% rownames_to_column('gene') %>% arrange(FDR)
sig_edgerLRT_genes <- res_LRT_tb %>% filter(FDR < 0.05) %>% arrange(FDR)
I think you forgot to specify the contrast/coef in the line
As a default, EdgeR tests the latest coefficient. Since "group" was specified last in the formula, the following are equivalent:
Despite this, I did run some correlation analysis on the log2Foldchanges and on the pvalues from both methods, and the pearson correlation was 0.40 and 0.36 respectively. So I'm thinking It might be related to something in my code
Oh you are right, it is probably not the issue here. Just to make sure, can you paste the
design
matrix ?Here is the first couple rows,
Hm ok. Sorry, I don't see what is wrong with the code... If you figure it out, please let us know.