Why removing sample from one group in DESeqDataSet changes DGE of other two groups ?
1
3
Entering edit mode
4.0 years ago
prabin.dm ▴ 260

I have a set of RNAseq data from 4 wt, 10 hets and 2 ko. I am trying to see if more replicates in hets leads to more degs in het vs ko than wt vs ko. Hence, I am here trying to take only 4 random replicates of hets from the DeseqDataSet and run analysis multiple times.

In each run, only difference is which hets are in the DESeqDataSet. All runs have have the same wt and het. Even then, my DEGs for wt vs ko are different, in each run. How is that?

random_het <- function(dds_lcr){
  minusHet <- sampleTable %>% filter(genotype %in% "KO.WT") %>% select(embryo) %>% distinct() %>% slice_sample(n =6) %>% as.list %>% unlist %>% paste0(., collapse = "|")
  dds_4het <- dds_lcr[, -grep(minusHet, colnames(dds_lcr))]
  dde_4het <- DESeq(dds_4het)
  WtVsKo4het <- results(dde_4het, name = "genotype_WT_vs_KO", lfcThreshold = 0, alpha = 0.05)
  HetVsKo4het <- results(dde_4het, name = "genotype_KO.WT_vs_KO", lfcThreshold = 0, alpha = 0.05)
  sum.data <- map(list(WtVsKo4het, HetVsKo4het), getSig, padj = 0.05) %>% map(., nrow) %>% as.data.frame %>% set_names(c("WtVsKo4het", "HetVsKo4het"))
  names(sum.data) <- c("WtVsKo4het", "HetVsKo4het")
  data <- list(sum.data, embyos = colData(dds_4het))
  return(data)

}

randomise_plot <- function(n){
  randomiseHet <- purrr::rerun(n, random_het(dds_lcr))
  df_dge <- sapply(randomiseHet, "[[", 1) %>% t %>% as.data.frame
  plot(df_dge)
  p1 <- recordPlot()
  dds_names <- sapply(randomiseHet, "[[", 2)
  return(list(p1, dds_names))
}

randomise_plot(n=10)

The plot shows in two runs there are different number of DEGs for both comparison. plot

RNA-Seq DESeq2 • 1.1k views
ADD COMMENT
4
Entering edit mode
4.0 years ago

Even if you compare only WT and KO for differential expression, DESeq2 internally integrates all samples (with the included hets in your case) to estimate dispersion. As dispersion is a measure of within group variability, changing what sample are included in a group (hets) will change the dispersion estimate, which will change the confidence of calling a gene differentially expressed.

To give you an example, consider the two following situations:

# expression for gene x (CASE A)
wt_1     100
wt_2     110
KO_1     200
KO_2     205
het_1    100
het_2    1000

# expression for gene x (CASE B)
wt_1     100
wt_2     110
KO_1     200
KO_2     205
het_3    200
het_4    200

In the first case, the dispersion estimate will be very high because the within group variability is huge in the het group. Therefore, the gene will not be significantly differentially expressed, even between condition wt and ko (within groups variability >>> between groups variability). In the second case, it most likely will be differentially expressed, because the within group variability and the dispersion estimate are much smaller.

For those reasons, it is best to include all samples in the differential expression analysis (unless there is a clear outlier/artifact). This usually provides the best dispersion estimate.

ADD COMMENT
0
Entering edit mode

Thank you. That makes perfect sense.

ADD REPLY

Login before adding your answer.

Traffic: 1814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6