Question

edgeR artifacts on p-value and/or fold change?

0

Entering edit mode

8.5 years ago

moxu ▴ 510

I use RSEM to align and quantify RNA-seqs and then use edgeR to do the differential gene expression analysis with or without ERCC normalization.

The problem is that I almost always see pairs of "wings" at the bottom of the -log(P-value) ~ log(FC) plot. For instance,

-log(P-value) vs log(FC)

At the bottom of the plot, you can see a set of genes on the left and another somewhat symmetric set of genes on the right are obviously separated from the other genes. I have seen such a pattern over and over again, with or without ERCC normalization before feeding the expression data to edgeR.

Any explanations?

rna-seq RSEM edgeR • 4.4k views

ADD COMMENT • link 8.5 years ago by moxu ▴ 510

0

Entering edit mode

Have a look at the counts for those genes, presumably there's a high degree of variation.

ADD REPLY • link 8.5 years ago by Devon Ryan 105k

0

Entering edit mode

Maybe you can use this script to find out what those genes are.

And this volcano plot instance in this page have a so-so similar "wings".

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

0

Entering edit mode

I've used some other methods to find out the genes. Thanks for your input.

ADD REPLY • link 8.5 years ago by moxu ▴ 510

0

Entering edit mode

I checked the expression file, and you are right -- they are the genes with 0 counts in the control samples, and low counts in the treatment samples.

Since edgeR uses negative binomial distribution to calculate the p-values, the p-values should be OK for theses genes. The problem now is the fold change -- they should be infinite before adjustment. So what shall we do with the genes on the wings? Remove them by setting expression or fold change threshold, or just leave them there?

ADD REPLY • link 8.5 years ago by moxu ▴ 510

0

Entering edit mode

Did you filter out constantly lowly expressed genes in all samples by cpm or counts before downstream analysis ? Usually these genes are already filtered out because they do not provide enough statistical evidence for judgement.

ADD REPLY • link 8.5 years ago by GZ1995 ▴ 410

0

Entering edit mode

No I didn't filter by counts for the reason stated above: negative binomial distribution should take care of whatever count values to give the correct p-values, so why bother? I am just not sure how the fold change is computed for a gene with 0 counts in one or more groups.

ADD REPLY • link 8.5 years ago by moxu ▴ 510

0

Entering edit mode

It matters for FDR ajustment and estimate mean-variance relationship (dispersion) for NB model as suggested by edgeR author here. edgeR computes logFC based on fitting NB generalized linear model with empirical bayes based shrinkage, not simply using raw counts, so it shouldn't be infinity.

ADD REPLY • link 8.5 years ago by GZ1995 ▴ 410

0

Entering edit mode

OK, now I see. It sounds reasonable to filter out low count genes. What would be a good method to filter?

ADD REPLY • link 8.5 years ago by moxu ▴ 510

1

Entering edit mode

Read Section 2.6 of edgeR userguide.

ADD REPLY • link 8.5 years ago by GZ1995 ▴ 410

0

Entering edit mode

Hi,

For genes with zero counts I have found similar situation in Trinity group, please search for:

>     It reflects some of the difficulties of trying to map reads back to partial transcript contigs.
>     
>     Here's some info on Kallisto and rna-seq Sleuth:
>     https://liorpachter.wordpress.com/2015/08/17/a-sleuth-for-rna-seq/
>     
>     If you're not getting read counts for some of your key targets, then that's clearly a problem.  You can try running your PE reads as
> single reads with the existing RSEM pipeline, and you might recoup the
> reads that correspond to the transcript that was reconstructed.

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

score 1 · Accepted Answer · 2016-10-13

I think the "wings" are caused by heuristic fold change computation for genes with 0 count in either the control or the treatment group -- fold change should be infinity, right? But apparently edgeR used a heuristic method (e.g. use global adjusted count values or add some background noise to each 0 count) to avoid the divided by 0 problem. That made the fold changes of such genes to deviate from those of other genes, and thus the "wings" separation.