EdgeR exactTest help
1
1
Entering edit mode
4.2 years ago
jmor ▴ 10

Hello, I am trying to test edgeR's exactTest function using simulated data but it is showing very poor performance. The code below has 2 very different sample means but the PValue I'm getting from edgeR is around 0.75. Is there something I'm missing or is the exactTest not very good? Thanks

x3=rnbinom(5000, size=5, mu=50000)
x4=rnbinom(5000, size=5, mu=500)

countsvec=matrix(c(x3,x4),1,length(x3)+length(x4)) groupvec=c(rep(1,length(x3)),rep(2,length(x4)))

y<-DGEList(counts=countsvec, group=groupvec, remove.zeros=TRUE) y<-estimateDisp(y) et<-exactTest(y) et$table$PValue

R rna-seq RNA-Seq differential expression edgeR • 2.2k views
ADD COMMENT
0
Entering edit mode

It seems to me that you are simulating 1 gene and 10000 samples, is that what you really want? It's an extremely unusual situation.

ADD REPLY
0
Entering edit mode

Thanks for reading my post. That is correct, I am looking at the case where I have expression levels of 1 gene from 10000 different samples broken into 2 groups. Does the method not work well for that case?

ADD REPLY
1
Entering edit mode
4.2 years ago
Gordon Smyth ★ 7.8k

You actually haven't simulated any differential expression. You have set the library sizes to be the same as the gene counts, so the gene has exactly the same expression level in every sample. So the non-significant result from exactTest is correct.

ADD COMMENT
0
Entering edit mode

Thanks for the answer. So how should I go about simulating data where I have 1 gene with 10000 samples from 2 different groups? I was mostly following the example from the EdgeR Manual when doing the simulation. I included the example below for completeness.

generate raw counts from NB, create list object

y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4) d <- DGEList(counts=y, group=c(1,1,2,2), lib.size=rep(1000,4)) de <- exactTest(d, dispersion=0.2) topTags(de)

ADD REPLY
0
Entering edit mode

You omitted a crucial part of the edgeR example. I suggest you read further to understand what library sizes are and how they enter into measures of gene expression.

Simulating 1 gene for 10,000 samples is a very strange thing to do, but is easy enough. Once you set the library sizes, you get an astronomically small p-value for your data because it is so lopsided between the groups:

> y<-DGEList(counts=countsvec, group=groupvec, lib.size=rep(1e7,10000))
> y<-estimateDisp(y)
Design matrix not provided. Switch to the classic mode.
> et<-exactTest(y)
> topTags(et)
Comparison of groups:  2-1 
      logFC   logCPM PValue FDR
1 -6.671313 11.31661      0   0
ADD REPLY

Login before adding your answer.

Traffic: 2502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6