Question

Edger: Very Low P-Value And Very High Variance Within The Group Of Replicates. What'S My Problem??

2

Entering edit mode

12.2 years ago

valentina ▴ 60

I'm using edgeR in order to perform differential expression analysis from RNA-seq experiment.

I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups.

I ran the edgeR pakage like follow:

x <- read.delim("my_reads_count.txt", row.names="GENE")
group <- factor(c(1,1,1,2,2,2))
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)    
et <- exactTest(y)

I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant. This is an example for my_reads_count.txt:

GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3    
ENSG00000198842    0    3    2    2    6666    3
ENSG00000257017    3    3    25    2002    29080    4

And for my_edgeR_resulta.txt:

GENE                                         logFC        logCPM       PValue          FDR
ENSG00000198842              9.863211e+00  5.4879462930 5.368843e-07 1.953612e-04
ENSG00000257017                  9.500927e+00  7.7139869397 8.072384e-10 7.171947e-07

I would like that the variance within the group is considered. Does anyone may help me? Some suggestion?

edger rna-seq differential-expression expression • 6.1k views

ADD COMMENT • link updated 12.2 years ago by Steve Lianoglou 5.2k • written 12.2 years ago by valentina ▴ 60

0

Entering edit mode

Is your raw data normalized?

ADD REPLY • link 12.2 years ago by Damian Kao 16k

0

Entering edit mode

The raw data refers to the count of reads mapping within the exons (data obtained running htseq-count). The normalization is performed with calcNormFactors(y). Am I correct?

ADD REPLY • link 12.2 years ago by valentina ▴ 60

score 1 · Answer 1 · 2013-06-18

1

Entering edit mode

12.2 years ago

Steve Lianoglou 5.2k

The variance is considered, but your signal is apparently a lot higher than the variance.

These two genes have monster-levels of expression in your "Group 2" -- you're looking at a locus with less than 10 reads in group 1, and thousands to tens-of-thousands of reads in group 2.

Are the library sizes wildly different between samples?

You might consider filtering out genes that do no exhibit minimal expression in at least 3 samples, which should remove your first gene (ENSG00000198842 ), and possibly your second gene.

ADD COMMENT • link 12.2 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Have I to remove these genes before the edgeR analysis? Or after??

ADD REPLY • link 12.2 years ago by valentina ▴ 60

0

Entering edit mode

You should remove them before you start the DGEList function. I don't think there's a standard way of figuring out the cutoff. However, the edgeR users guide mentions multiple ways to go about removing genes based upon low expression levels.

ADD REPLY • link 11.6 years ago by Jason ▴ 940