Edger Results Without Replicates, Fdr Looks Unnormal
1
3
Entering edit mode
11.0 years ago
xiaojuhu13 ▴ 150

I only get two samples without replicates for the edgeR analysis,but the results look unnormal,I set the common dispersion value equal to 0.4,most FDR equal to 1.

> raw.data<-read.table(file="48_50_1",header=T)
> d<-raw.data[,2:3]
> rownames(d)<-raw.data[,1]
> group<-factor(c("L","H"))
> design<-model.matrix(~group)
> d<-DGEList(counts=d,group=group)
> dim(d)
[1] 22928     2
> d <- calcNormFactors(d)
> keep <- rowSums(cpm(d)>100) >= 2
> d <- d[keep,]
> dim(d)
[1] 184   2
> d<-estimateGLMCommonDisp(d,design,method="deviance",robust="TRUE",subset=NULL)
Warning message:
In estimateGLMCommonDisp.default(y = y$counts, design = design,  :
  No residual df: setting dispersion to NA
> d$samples$lib.size <- colSums(d$counts)
> d
An object of class "DGEList"
$counts
        low high
AATK     45   77
ABCB1    56  120
ARHGAP4 261  383
C3      920  628
COL20A1  26   33
179 more rows ...

$samples
     group lib.size norm.factors
low      L    62017     1.046646
high     H    83398     0.955433

$AveLogCPM
[1]  9.742665 10.243595 12.110485 13.423755  8.758667
179 more elements ...

$common.dispersion
[1] NA

$design
  (Intercept) groupL
1           1      1
2           1      0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

> d$common.dispersion=0.4
> et <- exactTest(d)
> top<-topTags(et)
> top
Comparison of groups:  L-H 
                 logFC    logCPM      PValue       FDR
LOC101119889 -4.609729 10.669772 0.004828116 0.8883734
LOC101112298  1.983919  8.613781 0.201633226 1.0000000
LOC101121082 -1.577096 12.194037 0.257687080 1.0000000
PER2          1.556200 11.766520 0.265899128 1.0000000
GPT          -1.374615 12.820088 0.321135663 1.0000000
LOC101121333 -1.354515 10.613348 0.336138933 1.0000000
LOC101103050 -1.416573  8.305962 0.355318326 1.0000000
LOC101112300  1.420906  7.982203 0.380212348 1.0000000
LOC101110133 -1.174812 10.631434 0.406782402 1.0000000
LOC101108558 -1.141834 12.802543 0.407262691 1.0000000
> summary(de <- decideTestsDGE(et))
   [,1]
-1    0
0   184
1     0
> plotMDS(d)
Error in plotMDS.DGEList(d) : Need at least 3 columns of data

This is what the sessionInfo() gives

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.2.4  limma_3.16.8
edger • 12k views
ADD COMMENT
0
Entering edit mode

It was difficult for me to read the data you pasted. However, it is not uncommon that most FDR values are equal to one. It depends on the number of tests and on pvalues. You can find details on computation of FDR here A: How to estimate the false discovery rate (FDR) using bootstrap

ADD REPLY
0
Entering edit mode

I'm sorry, I'm just a newbie to edgeR. Because there are no replicates, after the step (d<-estimateGLMCommonDisp(d,design,method="deviance",robust="TRUE",subset=NULL) I don't get the dispersion value, so I set it handly to 0.4 as suggested by edgeR manual.But I get only 10 differential expression genes, basically no significant one gene. But the cufflinks results have 69 significant genes. So my question is that if there are some problems with this edgeR analysis.

ADD REPLY
1
Entering edit mode

I'm a newbie to edgeR, too. However, after a careful reading of your code, I think there might be a little misunderstanding of the manual, which led to your problem. As shown in code below, you directly set the dispersion value to 0.4,which actually should be given to BCV , since dispersion value is equal to BCV^2. In this way, your dispersion value is equal to 0.16, which will guarantee you a list of more DE genes.

> d$common.dispersion=0.4
> et <- exactTest(d)

According to solutions offered in chapter 2.11"What to do if you have no replicates", the standard way to manually set dispersion value includes 2 steps(,code below is copied directly from the manual,option 2):

> bcv <- 0.2
> counts <- matrix( rnbinom(40,size=1/bcv^2,mu=10), 20,2)
> y <- DGEList(counts=counts, group=1:2)
> et <- exactTest(y, dispersion=bcv^2)

In your case, you can try adding the first & last line into your script, and see if a larger number of DE genes are detected. Hope this can help.

ADD REPLY
0
Entering edit mode
11.0 years ago

So what exactly is so "unnormal" about the results? Without replicates, it'd be odd to find much of anything.

ADD COMMENT
0
Entering edit mode

I'm using the same data with cufflinks, DESeq, edgeR to find the differential expression. The results I find the latter two have similar results,but totally don't overlap with cufflinks. So I think it may be edgeR problems, but I don't know the reason.

ADD REPLY
1
Entering edit mode

I wouldn't expect any of the programs to have completely overlapping results, they use different algorithms. Why are you bothering doing a comparison when you have no replicates. The results are only questionably meaningful to begin with. I would not trust cuffdiff results if they're much different from those of DESeq.

ADD REPLY
0
Entering edit mode

Thanks, but I just get two samples without for practice. And I'd like to know if the alignment rate affects the differential expression results. I used tophat 2.0.8b, and checked the tophat/logs [huxj@LoginNode logs]$cat bowtie.left_kept_reads.log 43476250 reads; of these: 43476250 (100.00%) were unpaired; of these: 18745980 (43.12%) aligned 0 times 19250394 (44.28%) aligned exactly 1 time 5479876 (12.60%) aligned >1 times 56.88% overall alignment rate

ADD REPLY
0
Entering edit mode

This is single-ended then? That would seem to be a low overall alignment rate. Do you have a reference annotation or are you stuck aligning only to the genome? You'll generally get better results with an annotation file (GTF or GFF).

ADD REPLY
0
Entering edit mode

Paired-ended data. But I'm not sure tophat results (logs file ) are a right way to check the alignment rate. Somebody said we should use the flagstat bam file to check, if in that way, the alignment rate is higher than 70%. How do you check the alignment rate?

ADD REPLY
0
Entering edit mode

Yeah, you're better off using flagstat.

ADD REPLY
0
Entering edit mode

OK, thanks. So I at least realise that the alignment rate is not the problem.

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6