I use RSEM to align and quantify RNA-seqs and then use edgeR to do the differential gene expression analysis with or without ERCC normalization.
The problem is that I almost always see pairs of "wings" at the bottom of the -log(P-value) ~ log(FC) plot. For instance,
At the bottom of the plot, you can see a set of genes on the left and another somewhat symmetric set of genes on the right are obviously separated from the other genes. I have seen such a pattern over and over again, with or without ERCC normalization before feeding the expression data to edgeR.
Any explanations?
Have a look at the counts for those genes, presumably there's a high degree of variation.
Maybe you can use this script to find out what those genes are.
And this volcano plot instance in this page have a so-so similar "wings".
I've used some other methods to find out the genes. Thanks for your input.
I checked the expression file, and you are right -- they are the genes with 0 counts in the control samples, and low counts in the treatment samples.
Since edgeR uses negative binomial distribution to calculate the p-values, the p-values should be OK for theses genes. The problem now is the fold change -- they should be infinite before adjustment. So what shall we do with the genes on the wings? Remove them by setting expression or fold change threshold, or just leave them there?
Did you filter out constantly lowly expressed genes in all samples by
cpm
or counts before downstream analysis ? Usually these genes are already filtered out because they do not provide enough statistical evidence for judgement.No I didn't filter by counts for the reason stated above: negative binomial distribution should take care of whatever count values to give the correct p-values, so why bother? I am just not sure how the fold change is computed for a gene with 0 counts in one or more groups.
It matters for FDR ajustment and estimate mean-variance relationship (dispersion) for NB model as suggested by edgeR author here. edgeR computes logFC based on fitting NB generalized linear model with empirical bayes based shrinkage, not simply using raw counts, so it shouldn't be infinity.
OK, now I see. It sounds reasonable to filter out low count genes. What would be a good method to filter?
Read Section 2.6 of edgeR userguide.
Hi,
For genes with zero counts I have found similar situation in Trinity group, please search for: