Question

RNA-Seq analysis with DESeq2. Confusing results.

0

Entering edit mode

10.4 years ago

Armin • 0

Hi,

I analyze RNA-Seq data from three sample groups (non-pathology (5 samples), mild pathology (7 samples), extreme pathology (5 samples)). My goal is to find differentially expressed genes between non and mild, non and extreme, and mild and extreme. Read counts were prepared using HTSeq. Then I do exactly the same steps as described in the DESeq2 manual.

> library("DESeq2")
> directory <- "my_path"
> sampleFiles <- grep("pathol",list.files(directory),value=TRUE)
> sampleCondition <- sub("(.*pathol).*","\\1",sampleFiles)
> sampleTable <- data.frame(sampleName = sampleFiles,fileName = sampleFiles,condition = sampleCondition)
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,directory = directory,design= ~ condition)
> ddsHTSeq$condition <- factor(ddsHTSeq$condition,levels=c("nonpathol","extpathol"))
> ddsHTSeq$condition <- relevel(ddsHTSeq$condition, "nonpathol")
> ddsHTSeq<- DESeq(ddsHTSeq)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting generalized linear model
> res <- results(ddsHTSeq)

I perform this procedure for all three comparisons and in each separate comparison get seemingly biologically meaningful results. However, the problem is following:

Non vs Mild gives me 15 d.e. genes (padj <0.05)

Non vs Extreme - 1,249 d.e. genes

but

Mild vs. Extreme - only 16 d.e. genes!

Basic logic tells me that IF Non group differs from Extreme group by many hundreds genes AND the same Non group almost doesn't differ from Mild group THEN Mild group also has to differ from Extreme by many genes.

Is such logic wrong? If not then I don't get where could I make any mistake in the analysis.

RNA-Seq differential-expression deseq2 • 5.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.4 years ago by Armin • 0

Ram · Answer 1 · 2014-07-11

3

Entering edit mode

10.4 years ago

Devon Ryan 104k

Your "logic" is completely incorrect, in fact it reminds me of one of Zeno's paradoxes.

An R example (values generated with rnorm):

a <- c(0.9964361, 1.2685663, 0.5132198, -1.6368091, 3.0478775)
b <- c(3.7843679, 3.4882500, 0.1275235, 3.2042165, 1.9144505)
c <- c(4.205894, 3.270352, 4.922832, 4.107070, 3.656648)

If one performs the pair-wise T-tests, one finds that a vs. b and b vs. c are insignificant, while a vs. c is significant. By your reasoning, that should be an impossibility, but it's certainly not.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

I already asked this question on SeqAnswers but can I ask you here too. In which function can I use the argument minReplicatesForReplace? Both DESeq() and results() produce the following error message:

Error in results(ddsHTSeq, minReplicatesForReplace = 5) : 
  unused argument (minReplicatesForReplace = 5)

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Armin • 0

0

Entering edit mode

It should work with the DESeq() function. What version are you using?

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

DESeq2_1.0.19

command:

ddsHTSeq<- DESeq(ddsHTSeq, minReplicatesForReplace = 5)

Here is my sessionInfo

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] DESeq2_1.0.19             RcppArmadillo_0.4.300.8.0
[3] Rcpp_0.11.2               lattice_0.20-29
[5] Biobase_2.20.1            GenomicRanges_1.12.5
[7] IRanges_1.18.4            BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] annotate_1.38.0      AnnotationDbi_1.22.6 DBI_0.2-7
 [4] genefilter_1.42.0    grid_3.0.1           locfit_1.5-9.1
 [7] RColorBrewer_1.0-5   RSQLite_0.11.4       splines_3.0.1
[10] stats4_3.0.1         survival_2.37-7      tools_3.0.1
[13] XML_3.98-1.1         xtable_1.7-3

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Armin • 0

0

Entering edit mode

You're using the version from Bioconductor 2.12, which is fairly outdated and that option didn't exist yet back then. You should just update the libraries on your local installation.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you, Devon!

ADD REPLY • link 10.4 years ago by Armin • 0

0

Entering edit mode

I link your answer !

ADD REPLY • link 8.5 years ago by jimmy_zeng ▴ 90

Ram · Answer 2 · 2014-07-11

1

Entering edit mode

10.4 years ago

Damian Kao 16k

Genes that are not significantly (whatever threshold you choose) differentially expressed aren't necessarily equally expressed. It just means your variance is too high to determine whether these genes are differentially expressed.

Let's say your mild sample replicates vary highly compare to your non and extreme sample replicates. When you perform the differential expression test between mild and non/extreme, you would not be confident in whether the difference is due to variance or real effect.

However, you are using a generalized linear model for your DE. I am not very familiar with how that works in relation to differential expression. But I think that's supposed to take into account different variances.

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Damian Kao 16k

0

Entering edit mode

Stated somewhat differently, non-significance is not equivalence.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

1

Entering edit mode

Yeah that first sentence was pretty confusing. I changed it.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Damian Kao 16k

0

Entering edit mode

Thank you for explanation, Devon and Damian! Also, I found another possible reason that adds to such results. In comparisons with Mild many genes were set to NA due to outliers.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Armin • 0