Dear Biostars community:
I ran into a question when I was doing RNA-seq analysis using DeSEQ2. I have 4 groups, 3 samples per group, and set my groups using the following code:
(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))
Can anyone tell me how the comparison in DeSEQ2 is done? I am not sure what the p value and log2FC mean in the default output. Supposedly it should be outputting ctl vs A in the default output. But the data is different when I have only ctl and A as input (2 groups). Also, is there a way to specify how the program performs the comparison? I read some previous threads but the answers were not very clear. Any help will be highly appreciated!
Best,
Wenhan
Below is the code I used:
countdata <- read.table("B6_B6MHV_B6MHVWY.txt", header=TRUE, row.names=1)
countdata <- countdata[ ,6:ncol(countdata)]
colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata))
countdata <- as.matrix(countdata)
head(countdata)
(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))
library(DESeq2)
(coldata <- data.frame(row.names=colnames(countdata), condition))
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds
dds <- DESeq(dds)
# Plot
dispersions
png("qc-dispersions.png", 2000, 2000, pointsize=20)
plotDispEsts(dds, main="Dispersion plot")
dev.off()
# Regularized log transformation for clustering/heatmaps, etc
rld <- rlogTransformation(dds)
head(assay(rld))
hist(assay(rld))
# Colors for plots below
## Ugly:
## (mycols <- 1:length(unique(condition)))
## Use RColorBrewer, better
library(RColorBrewer)
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(condition))])
# Sample distance heatmap
sampleDists <- as.matrix(dist(t(assay(rld))))
library(gplots)
png("qc-heatmap-samples.png", w=1500, h=2500, pointsize=1500)
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
col=colorpanel(100, "black", "white"),
ColSideColors=mycols[condition], RowSideColors=mycols[condition],
margin=c(10, 10), main="Sample Distance Matrix")
dev.off()
# Principal components analysis
## Could do with built-in DESeq2 function:
## DESeq2::plotPCA(rld, intgroup="condition")
## I like mine better:
rld_pca <- function (rld, intgroup = "condition", ntop = 500, colors=NULL, legendpos="bottomleft", main="PCA Biplot", textcx=1, ...) {
require(genefilter)
require(calibrate)
require(RColorBrewer)
rv = rowVars(assay(rld))
select = order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
pca = prcomp(t(assay(rld)[select, ]))
fac = factor(apply(as.data.frame(colData(rld)[, intgroup, drop = FALSE]), 1, paste, collapse = " : "))
if (is.null(colors)) {
if (nlevels(fac) >= 3) {
colors = brewer.pal(nlevels(fac), "Paired")
} else {
colors = c("black", "red")
}
}
pc1var <- round(summary(pca)$importance[2,1]*100, digits=1)
pc2var <- round(summary(pca)$importance[2,2]*100, digits=1)
pc1lab <- paste0("PC1 (",as.character(pc1var),"%)")
pc2lab <- paste0("PC1 (",as.character(pc2var),"%)")
plot(PC2~PC1, data=as.data.frame(pca$x), bg=colors[fac], pch=21, xlab=pc1lab, ylab=pc2lab, main=main, ...)
with(as.data.frame(pca$x), textxy(PC1, PC2, labs=rownames(as.data.frame(pca$x)), cex=textcx))
legend(legendpos, legend=levels(fac), col=colors, pch=20)
# rldyplot(PC2 ~ PC1, groups = fac, data = as.data.frame(pca$rld),
# pch = 16, cerld = 2, aspect = "iso", col = colours, main = draw.key(key = list(rect = list(col = colours),
# terldt = list(levels(fac)), rep = FALSE)))
}
png("qc-pca.png", 1500, 1500, pointsize=25)
rld_pca(rld, colors=mycols, intgroup="condition", xlim=c(-75, 35))
dev.off()
# Get differential expression results
res <- results(dds)
table(res$padj<0.05)
## Order by adjusted p-value
res <- res[order(res$padj), ]
## Merge with normalized count data
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
## Write results
write.csv(resdata, file="diffexpr-results.csv")
Hello Kevin, I am using your example it is working well. I have few question from my data: 1. How can I remove batch effect from my data with DESEQ2. 2. I know the VST function available in DESEQ2 for remove batch effect, but after this step how can we do DE analysis from the adjusted data. Hope you can understand my problem.
Hey, there is a function called removeBatchEffects (limma), which may be of use to you. There is a practical example given here with DESeq2: https://support.bioconductor.org/p/76099/#93176
It is neither VST nor rlog that are removing batch, technically speaking.
Sorry Kevin I am confused with data I have, can you please have a look
I have gene expression raw counts like this
For each patients I also have some characteristics
I want differentially expressed genes, but I can not figure out how to let all of these meta data in my DEseq2 design For example differential expression which takes all of these variables into calculation
Can you help me please?
Thank you
The DESeq2 design formula behaves much the same as standard regression formulae. So, if you want to include these, you just need to add them like this:
That assumes that the effects of these are 'additive'. It is possible to have interactions and multiplicative effects, too, but these are more complex. If you are unsure, then it is better to keep the design formula very simple.
You may want to consider, for example, the fact that some of these columns are describing virtually the same trait / characteristic, in which case their usage in a model must be considered.
Thank you
I done so
Sorry, please take some time to debug this on your own. I have no time to take you through your analysis step-by-step. If you have a supervisor, please contact her / him / it
Thank you, Kevin, but i noticed that
If compare "A" vs "C", lfcShrink( ) with type = "apeglm" will next be impossible to do due to the lack of corresponding coef parameter. Does this mean I need to regenerate a new dds with "C" as a reference?
Hey, in that case, you can go back and re-level the factors such that you can obtain the required coefficient, as to which I allude here: A: How to change conditions in DESeq2
It is also now mentioned briefly in the vignette:
I have same problem. i have 3 sample , one with WT and 2 others are treatment1 and treatment2. It is confusing for me when I try either approach and e.g. for WT vs treatment1 contrast i am getting deseq2 result diiferent ,
-dds2<- DESeq(dds, betaPrior = FALSE) -resultsNames(dds2) -res <- results(dds2, name="condition_treatment1_vs_WT", alpha = 0.05) -summary(res)
-out of 26151 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2463, 9.4% LFC < 0 (down) : 2982, 11% outliers [1] : 2, 0.0076% low counts [2] : 5070, 19% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
-compared with running it alone through pipeline
-des <- DESeq(dds) -res <- results(des, alpha = 0.05) -summary(res)
-out of 24475 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2657, 11% LFC < 0 (down) : 3140, 13% outliers [1] : 0, 0% low counts [2] : 3322, 14% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
Can anyone explain why and which approach is right i.e. DESeq 1 vs 1 comparison or multiple group comparison?