Hi all, I'm performing some differential analysis with edgeR and I need to take into account GC content in my regions. From edgeR manual I see I should set GC percentage as "offset" in linear model. Unfortunately there is very little documentation about this, so I need some hint.
Offsets can be passed both to estimateDisp
and glmFit
functions, but it is not clear how they are handled in the two steps. Setting offset in estimateDisp, i.e.
y = DGEList(…)
y = calcNormFactors(y)
y$offset = gc_content #this is a vector with %GC for every region in y
y = estimateDisp(y, design)
increases the dispersion values (common, tagwise and trended) in the DGEList object, with negative effect on final pvalues and FDR. Setting it at GLM, i.e.
y = DGEList(…)
y = calcNormFactors(y)
y = estimateDisp(y, design)
fit = glmFit.default(y, design, offset=gc_content, dispersion=y$tagwise.dispersion)
does exactly the opposite: all p-values and FDR become "incredibly" significant, also average logCPM for each region is increased 10-fold (which, I suspect, is the reason why I have lower p-values).
What would be the proper way to include GC information in edgeR?
d
Do you have a notable GC bias between some of the samples? If not, don't bother doing this.
Influence of GC in standard differential expression analysis is limited, since you are comparing gene A versus gene A. Even if it's a high GC gene, it's the same for both samples and therefore no need to correct for.