Question

Why removing sample from one group in DESeqDataSet changes DGE of other two groups ?

3

Entering edit mode

4.0 years ago

prabin.dm ▴ 260

I have a set of RNAseq data from 4 wt, 10 hets and 2 ko. I am trying to see if more replicates in hets leads to more degs in het vs ko than wt vs ko. Hence, I am here trying to take only 4 random replicates of hets from the DeseqDataSet and run analysis multiple times.

In each run, only difference is which hets are in the DESeqDataSet. All runs have have the same wt and het. Even then, my DEGs for wt vs ko are different, in each run. How is that?

random_het <- function(dds_lcr){
  minusHet <- sampleTable %>% filter(genotype %in% "KO.WT") %>% select(embryo) %>% distinct() %>% slice_sample(n =6) %>% as.list %>% unlist %>% paste0(., collapse = "|")
  dds_4het <- dds_lcr[, -grep(minusHet, colnames(dds_lcr))]
  dde_4het <- DESeq(dds_4het)
  WtVsKo4het <- results(dde_4het, name = "genotype_WT_vs_KO", lfcThreshold = 0, alpha = 0.05)
  HetVsKo4het <- results(dde_4het, name = "genotype_KO.WT_vs_KO", lfcThreshold = 0, alpha = 0.05)
  sum.data <- map(list(WtVsKo4het, HetVsKo4het), getSig, padj = 0.05) %>% map(., nrow) %>% as.data.frame %>% set_names(c("WtVsKo4het", "HetVsKo4het"))
  names(sum.data) <- c("WtVsKo4het", "HetVsKo4het")
  data <- list(sum.data, embyos = colData(dds_4het))
  return(data)

}

randomise_plot <- function(n){
  randomiseHet <- purrr::rerun(n, random_het(dds_lcr))
  df_dge <- sapply(randomiseHet, "[[", 1) %>% t %>% as.data.frame
  plot(df_dge)
  p1 <- recordPlot()
  dds_names <- sapply(randomiseHet, "[[", 2)
  return(list(p1, dds_names))
}

randomise_plot(n=10)

The plot shows in two runs there are different number of DEGs for both comparison. plot

RNA-Seq DESeq2 • 1.1k views

ADD COMMENT • link updated 4.0 years ago by Carlo Yague 8.9k • written 4.0 years ago by prabin.dm ▴ 260

score 4 · Accepted Answer · 2020-12-11

Even if you compare only WT and KO for differential expression, DESeq2 internally integrates all samples (with the included hets in your case) to estimate dispersion. As dispersion is a measure of within group variability, changing what sample are included in a group (hets) will change the dispersion estimate, which will change the confidence of calling a gene differentially expressed.

To give you an example, consider the two following situations:

# expression for gene x (CASE A)
wt_1     100
wt_2     110
KO_1     200
KO_2     205
het_1    100
het_2    1000

# expression for gene x (CASE B)
wt_1     100
wt_2     110
KO_1     200
KO_2     205
het_3    200
het_4    200

In the first case, the dispersion estimate will be very high because the within group variability is huge in the het group. Therefore, the gene will not be significantly differentially expressed, even between condition wt and ko (within groups variability >>> between groups variability). In the second case, it most likely will be differentially expressed, because the within group variability and the dispersion estimate are much smaller.

For those reasons, it is best to include all samples in the differential expression analysis (unless there is a clear outlier/artifact). This usually provides the best dispersion estimate.