generate raw counts from NB, create list object

Question

EdgeR exactTest help

1

Entering edit mode

4.2 years ago

jmor ▴ 10

Hello, I am trying to test edgeR's exactTest function using simulated data but it is showing very poor performance. The code below has 2 very different sample means but the PValue I'm getting from edgeR is around 0.75. Is there something I'm missing or is the exactTest not very good? Thanks

x3=rnbinom(5000, size=5, mu=50000)
x4=rnbinom(5000, size=5, mu=500)

countsvec=matrix(c(x3,x4),1,length(x3)+length(x4))
groupvec=c(rep(1,length(x3)),rep(2,length(x4)))

y<-DGEList(counts=countsvec, group=groupvec, remove.zeros=TRUE)
y<-estimateDisp(y)
et<-exactTest(y)
et$table$PValue

R rna-seq RNA-Seq differential expression edgeR • 2.2k views

ADD COMMENT • link updated 4.2 years ago by Gordon Smyth ★ 7.8k • written 4.2 years ago by jmor ▴ 10

0

Entering edit mode

It seems to me that you are simulating 1 gene and 10000 samples, is that what you really want? It's an extremely unusual situation.

ADD REPLY • link 4.2 years ago by dariober 15k

0

Entering edit mode

Thanks for reading my post. That is correct, I am looking at the case where I have expression levels of 1 gene from 10000 different samples broken into 2 groups. Does the method not work well for that case?

ADD REPLY • link 4.2 years ago by jmor ▴ 10

score 1 · Answer 1 · 2020-11-03

1

Entering edit mode

4.2 years ago

Gordon Smyth ★ 7.8k

You actually haven't simulated any differential expression. You have set the library sizes to be the same as the gene counts, so the gene has exactly the same expression level in every sample. So the non-significant result from exactTest is correct.

ADD COMMENT • link 4.2 years ago by Gordon Smyth ★ 7.8k

0

Entering edit mode

Thanks for the answer. So how should I go about simulating data where I have 1 gene with 10000 samples from 2 different groups? I was mostly following the example from the EdgeR Manual when doing the simulation. I included the example below for completeness.



generate raw counts from NB, create list object

y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4)
d <- DGEList(counts=y, group=c(1,1,2,2), lib.size=rep(1000,4))
de <- exactTest(d, dispersion=0.2)
topTags(de)

ADD REPLY • link 4.2 years ago by jmor ▴ 10

0

Entering edit mode

You omitted a crucial part of the edgeR example. I suggest you read further to understand what library sizes are and how they enter into measures of gene expression.

Simulating 1 gene for 10,000 samples is a very strange thing to do, but is easy enough. Once you set the library sizes, you get an astronomically small p-value for your data because it is so lopsided between the groups:

> y<-DGEList(counts=countsvec, group=groupvec, lib.size=rep(1e7,10000))
> y<-estimateDisp(y)
Design matrix not provided. Switch to the classic mode.
> et<-exactTest(y)
> topTags(et)
Comparison of groups:  2-1 
      logFC   logCPM PValue FDR
1 -6.671313 11.31661      0   0

ADD REPLY • link 4.2 years ago by Gordon Smyth ★ 7.8k