Hello I have two questions about DESeq2. One is about the interpretation of the output when more covariates are included in the model. Second is about the ranking metric for GSEA.
I used the normal DESeq workflow:
# Build dds
dds <- DESeqDataSetFromTximport(txi = tx.salmon, colData = met, design = ~ SEX + RECUR + threshold)
# Pre-filter and remove genes that have no or little information on gene expression
keep <- rowSums( counts(dds) >= 10 ) >= 15 #Filtering for threshold
dds <- dds[keep,]
# Run DeSeq2
dds <- DESeq(dds, test = 'Wald')
#Wald test
res = results(dds, contrast = c("threshold", "TRUE", "FALSE"), alpha = 0.05)
## Save the unshrunken results
res_noshrink <- res
# # Apply fold change shrinkage
res_lfcShrink_apeglm <- lfcShrink(dds, coef= gsub(' ', '_', gsub('^.+statistic: ', '', res@elementMetadata$description[4])), type="apeglm")
##Save it as a tibble for further GSEA analysis
res_tb <- res_lfcShrink_apeglm %>%
data.frame() %>%
rownames_to_column(var="gene") %>%
as_tibble()
## Remove any NA values (reduces the data by quite a bit) because KEGG pathway is better with EntreZid
res_tb <- res_tb %>%
arrange(padj)
res_entrez <- dplyr::filter(res_tb, entrezid != "NA")
## Remove any Entrez duplicates
res_entrez <- res_entrez[which(duplicated(res_entrez$entrezid) == F), ]
## Extract the foldchanges
foldchanges <- res_entrez$log2FoldChange
## Name each fold change with the corresponding Entrez ID
names(foldchanges) <- res_entrez$entrezid
## Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)
foldchanges
First Questions: The logFoldChanges I get in the res_tb dataframe are the LogFoldChanges for my variable threshold and they are adjusted for SEX and RECUR in this case? Just to be sure.
Second Question: which metrics to use exactly to rank my genes for further GSEA? I read different literature and ask some colleagues that are bioinformaticians, but I am confused now. I used the shrunken log2FoldChanges. But then I got the feedback that it does not account for the significant p values. One colleague proposed to use the stat column instead. But I think that this is not right because it does not take the shrunken logFoldChanges into account (as long as I understood).
Input would be highly appreciated!
Thanks!