DESeq2 multiple group comparison
1
5
Entering edit mode
5.9 years ago
dazhudou1122 ▴ 140

Dear Biostars community:

I ran into a question when I was doing RNA-seq analysis using DeSEQ2. I have 4 groups, 3 samples per group, and set my groups using the following code:

(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))

Can anyone tell me how the comparison in DeSEQ2 is done? I am not sure what the p value and log2FC mean in the default output. Supposedly it should be outputting ctl vs A in the default output. But the data is different when I have only ctl and A as input (2 groups). Also, is there a way to specify how the program performs the comparison? I read some previous threads but the answers were not very clear. Any help will be highly appreciated!

Best,

Wenhan

Below is the code I used:

countdata <- read.table("B6_B6MHV_B6MHVWY.txt", header=TRUE, row.names=1)
countdata <- countdata[ ,6:ncol(countdata)]
colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata))
countdata <- as.matrix(countdata)
head(countdata)
(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))
library(DESeq2)
(coldata <- data.frame(row.names=colnames(countdata), condition))
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds
dds <- DESeq(dds)
# Plot
dispersions
png("qc-dispersions.png", 2000, 2000, pointsize=20)
plotDispEsts(dds, main="Dispersion plot")
dev.off()

# Regularized log transformation for clustering/heatmaps, etc
rld <- rlogTransformation(dds)
head(assay(rld))
hist(assay(rld))

# Colors for plots below
## Ugly:
## (mycols <- 1:length(unique(condition)))
## Use RColorBrewer, better
library(RColorBrewer)
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(condition))])

# Sample distance heatmap
sampleDists <- as.matrix(dist(t(assay(rld))))
library(gplots)
png("qc-heatmap-samples.png", w=1500, h=2500, pointsize=1500)
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          ColSideColors=mycols[condition], RowSideColors=mycols[condition],
          margin=c(10, 10), main="Sample Distance Matrix")
dev.off()

# Principal components analysis
## Could do with built-in DESeq2 function:
## DESeq2::plotPCA(rld, intgroup="condition")
## I like mine better:
rld_pca <- function (rld, intgroup = "condition", ntop = 500, colors=NULL, legendpos="bottomleft", main="PCA Biplot", textcx=1, ...) {
  require(genefilter)
  require(calibrate)
  require(RColorBrewer)
  rv = rowVars(assay(rld))
  select = order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
  pca = prcomp(t(assay(rld)[select, ]))
  fac = factor(apply(as.data.frame(colData(rld)[, intgroup, drop = FALSE]), 1, paste, collapse = " : "))
  if (is.null(colors)) {
    if (nlevels(fac) >= 3) {
      colors = brewer.pal(nlevels(fac), "Paired")
    }   else {
      colors = c("black", "red")
    }
  }
  pc1var <- round(summary(pca)$importance[2,1]*100, digits=1)
  pc2var <- round(summary(pca)$importance[2,2]*100, digits=1)
  pc1lab <- paste0("PC1 (",as.character(pc1var),"%)")
  pc2lab <- paste0("PC1 (",as.character(pc2var),"%)")
  plot(PC2~PC1, data=as.data.frame(pca$x), bg=colors[fac], pch=21, xlab=pc1lab, ylab=pc2lab, main=main, ...)
  with(as.data.frame(pca$x), textxy(PC1, PC2, labs=rownames(as.data.frame(pca$x)), cex=textcx))
  legend(legendpos, legend=levels(fac), col=colors, pch=20)
  #     rldyplot(PC2 ~ PC1, groups = fac, data = as.data.frame(pca$rld),
  #            pch = 16, cerld = 2, aspect = "iso", col = colours, main = draw.key(key = list(rect = list(col = colours),
  #                                                                                         terldt = list(levels(fac)), rep = FALSE)))
}
png("qc-pca.png", 1500, 1500, pointsize=25)
rld_pca(rld, colors=mycols, intgroup="condition", xlim=c(-75, 35))
dev.off()

# Get differential expression results
res <- results(dds)
table(res$padj<0.05)
## Order by adjusted p-value
res <- res[order(res$padj), ]
## Merge with normalized count data
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
## Write results
write.csv(resdata, file="diffexpr-results.csv")
RNA-Seq Deseq2 • 32k views
ADD COMMENT
22
Entering edit mode
5.9 years ago

Your problem may be in setting the reference level for your condition factor.

Here is a reproducible example for anyone to follow:

# create random data of 12 samples and 4800 genes
x <- round(matrix(rexp(480 * 10, rate=.1), ncol=12), 0)
rownames(x) <- paste("gene", 1:nrow(x))
colnames(x) <- paste("sample", 1:ncol(x))

# fake metadata
coldata <- data.frame(
  condition = factor(c(
    rep("ctl", 3),
    rep("A", 3),
    rep("B", 3),
    rep("C", 3))))

# set `ctl` as reference level
coldata$condition <- relevel(coldata$condition, ref = "ctl")

# run DESeq2
library(DESeq2)
dds <- DESeqDataSetFromMatrix(
  countData = x,
  colData = coldata,
  design= ~ condition)
dds <- DESeq(dds, betaPrior = FALSE)

resultsNames(dds)
[1] "Intercept"          "condition_A_vs_ctl" "condition_B_vs_ctl"
[4] "condition_C_vs_ctl"

# generate results table for A vs ctl
res <- results(dds, name="condition_A_vs_ctl")

# same as above but with lfc shrinkage
res <- lfcShrink(dds, coef = 'condition_A_vs_ctl', type = 'apeglm', res = res)
res
log2 fold change (MAP): condition A vs ctl 
Wald test p-value: condition A vs ctl 
DataFrame with 400 rows and 5 columns
                 baseMean      log2FoldChange             lfcSE
                <numeric>           <numeric>         <numeric>
gene 1   6.52931993577912   0.072476893785378 0.357142842617753
gene 2   5.76235176928007 -0.0933872559160501 0.363373514145294
gene 3   6.50856065196759 -0.0204496319490765 0.345984423938138
gene 4   8.22529367562065  -0.214564268854677 0.443132280960639
gene 5   5.79513765732215  -0.124262442880464 0.378075429134928
...                   ...                 ...               ...
gene 396 6.31562199001634   0.216432088296947 0.433285297905755
gene 397 7.06387542770804 -0.0514375974511503 0.347860712050382
gene 398 11.7113061577201 0.00284345055606203 0.345680871683025
gene 399 6.49584500337424 -0.0476168389832278 0.345546422494701
gene 400  9.9964224972449  -0.163803486191863 0.398777125778805
                     pvalue              padj
                  <numeric>         <numeric>
gene 1    0.483733657727805 0.868376605216856
gene 2     0.37470748747808 0.833726100572576
gene 3    0.849966094025773 0.980516244268461
gene 4   0.0516046628774167 0.562928133989655
gene 5    0.228734151969893 0.786123281264348
...                     ...               ...
gene 396 0.0892978383641832 0.562928133989655
gene 397  0.649120107927275 0.943542041143732
gene 398  0.983171679048765 0.997357437589209
gene 399  0.684227860534656 0.953646890096339
gene 400  0.139850562852436 0.647215395526392

You can also compare other levels, e.g., A vs C

results(dds, c("condition", "A", "C"))

See the DESeq2 vignette for all other possible functionality.

Kevin

ADD COMMENT
0
Entering edit mode

Hello Kevin, I am using your example it is working well. I have few question from my data: 1. How can I remove batch effect from my data with DESEQ2. 2. I know the VST function available in DESEQ2 for remove batch effect, but after this step how can we do DE analysis from the adjusted data. Hope you can understand my problem.

ADD REPLY
0
Entering edit mode

Hey, there is a function called removeBatchEffects (limma), which may be of use to you. There is a practical example given here with DESeq2: https://support.bioconductor.org/p/76099/#93176

It is neither VST nor rlog that are removing batch, technically speaking.

ADD REPLY
0
Entering edit mode

Sorry Kevin I am confused with data I have, can you please have a look

I have gene expression raw counts like this

          A1 B1   C1     D1  E1  F1
A2M     511 1623 665    208 553 469
AADAT   34  137 372     7   52  124
ABCB1   119 114 123    22   25  89
ABCB11  27  186 200    27   30  49
ABCC2   1   10  21     1    8   3

For each patients I also have some characteristics

> head(meta)
    a b c d d
A1  3     1  3    0          2
B1  2     1  0    0          1
C1  2     1  0    0          2
D1  0     0  0    0          2
E1  1     0  0    1          3
F1  0     0  0    0          1
>

I want differentially expressed genes, but I can not figure out how to let all of these meta data in my DEseq2 design For example differential expression which takes all of these variables into calculation

Can you help me please?

Thank you

ADD REPLY
2
Entering edit mode

The DESeq2 design formula behaves much the same as standard regression formulae. So, if you want to include these, you just need to add them like this:

~ IC + PD.L1 + TC + TILs + TILs.group

That assumes that the effects of these are 'additive'. It is possible to have interactions and multiplicative effects, too, but these are more complex. If you are unsure, then it is better to keep the design formula very simple.

You may want to consider, for example, the fact that some of these columns are describing virtually the same trait / characteristic, in which case their usage in a model must be considered.

ADD REPLY
0
Entering edit mode

Thank you

I done so

> head(meta)
   IC PD.L1 TC TILs TILs.group
A1  3     1  3    0          2
B1  2     1  0    0          1
C1  2     1  0    0          2
D1  0     0  0    0          2
E1  1     0  0    1          3
F1  0     0  0    0          1
> 

> IC=as.factor(c(rep ("0",3), rep("1", 1),rep("2", 2),rep("3", 2)))
> PD.L1=as.factor(c(rep ("High",4), rep("Low", 4)))
> TC=as.factor(c(rep ("0",6), rep("1", 1),rep("3", 1)))
> TILs=as.factor(c(rep ("0",6), rep("1", 2)))
> TILs.group=as.factor(c(rep ("1",3), rep("2", 3),rep("3", 2)))
> dds=DESeqDataSetFromMatrix(countData = data, colData = meta, design        =       ~ IC + PD.L1 + TC + TILs + TILs.group)
  the design formula contains a numeric variable with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function
> 
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates
fitting model and testing
> dds$IC
[1] 3 2 2 0 1 0 3 0
> res <-results (dds, contrast=c("IC", "0", "1","2","3"))
Error in checkContrast(contrast, resNames) : 
  'contrast', as a character vector of length 3, should have the form:
contrast = c('factorName','numeratorLevel','denominatorLevel'),
see the manual page of ?results for more information
> 
> res <-results (dds, contrast=c("IC", "0", "1"))
Error in if (!expanded & (hasIntercept | noInterceptPullCoef)) { : 
  argument is of length zero
> 
>
ADD REPLY
2
Entering edit mode

Sorry, please take some time to debug this on your own. I have no time to take you through your analysis step-by-step. If you have a supervisor, please contact her / him / it

ADD REPLY
0
Entering edit mode

Thank you, Kevin, but i noticed that

You can also compare other levels, e.g., A vs C

results(dds, c("condition", "A", "C"))

If compare "A" vs "C", lfcShrink( ) with type = "apeglm" will next be impossible to do due to the lack of corresponding coef parameter. Does this mean I need to regenerate a new dds with "C" as a reference?

ADD REPLY
1
Entering edit mode

Hey, in that case, you can go back and re-level the factors such that you can obtain the required coefficient, as to which I allude here: A: How to change conditions in DESeq2

It is also now mentioned briefly in the vignette:

Although apeglm cannot be used with contrast, we note that many designs can be easily rearranged such that what was a contrast becomes its own coefficient

ADD REPLY
0
Entering edit mode

I have same problem. i have 3 sample , one with WT and 2 others are treatment1 and treatment2. It is confusing for me when I try either approach and e.g. for WT vs treatment1 contrast i am getting deseq2 result diiferent ,

-dds2<- DESeq(dds, betaPrior = FALSE) -resultsNames(dds2) -res <- results(dds2, name="condition_treatment1_vs_WT", alpha = 0.05) -summary(res)

-out of 26151 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2463, 9.4% LFC < 0 (down) : 2982, 11% outliers [1] : 2, 0.0076% low counts [2] : 5070, 19% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

-compared with running it alone through pipeline

-des <- DESeq(dds) -res <- results(des, alpha = 0.05) -summary(res)

-out of 24475 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2657, 11% LFC < 0 (down) : 3140, 13% outliers [1] : 0, 0% low counts [2] : 3322, 14% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

Can anyone explain why and which approach is right i.e. DESeq 1 vs 1 comparison or multiple group comparison?

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6