Question

Question: Why residuals can be used as new input in bioinformatic analysis?

0

Entering edit mode

19 months ago

Lalaland ▴ 40

I have seen some people use residuals (obtained from regression model adjusting with covariates on dependent variable) as a new input for bioinformatics analysis, instead of the original input. I understand that residuals represents the difference between the observed value and expected value.

However, I am still having trouble to follow. How could the residuals be used as input in the analysis?

For example, in PrediXcan, where residuals were used as the new gene expression data.

Lastly, is it reasonable to use residuals in differential analysis or correlation analysis?

Any comments/suggestions will be appreciated! Thanks!

predixcan residuals • 673 views

ADD COMMENT • link updated 19 months ago by LChart 4.6k • written 19 months ago by Lalaland ▴ 40

score 2 · Answer 1 · 2023-05-05

The topic underlying all of the tasks you mentioned (PrediXcan, differential [expression] analysis, [gene] correlation) is really one of variance components or variance partitioning, and can be effectively summarized by the mixed linear model

eq1

or more explicitly by breaking out effect estimates we "care" about (X/beta and Z/u) from covariates or nuisances (C/t, W/v):

enter image description here

Importantly, if t and v are not 0, then there can be issues with he estimation of beta and u. In particular, correlations (or really non-orthogonality) between (X, Z) and (C, W) lead to omitted variable bias; an even if these are orthogonal the standard error of the estimates of beta and u will be larger than if the total variance of y had been appropriately partitioned; and indeed the efficiency of the estimators for beta and u can suffer immensely.

Residualization is non-optimal but conservative and fits the surrogate model

enter image description here

so that

enter image description here

Note that the effects of C and v are well and truly gone; however the resulting estimates for beta and u will be deflated, with the extent of deflation depending on the correlation between (X,Z) and (C, W). In effect the residualization approach makes the assumption that the covariates explain the most possible amount of variation, whereas the full linear model will partially apportion the variance.

So that's what residualization is doing. Why can it be done? Obviously if X = C (or X = P*C for some P) then this completely fails, since the covariates completely explain the phenotype (or genotype or whatever). So this generally only makes sense if: (1) t is of the same or smaller order as beta [same for u/v]; (2) X is "far from" C (same for W,Z).

In nearly all cases, the primary drivers of variance for gene expression are going to be things like (a) batch; (b) library size; (c) sample input amount; (d) RNA integrity and other purely technical factors. If the experiment was well designed these should be largely independent of genotype or phenotype, and residualization can be performed with little consequence. However in some cases (indeed, in far too many cases) proper experimental design was not considered leading to confounding, and in such cases residualization will necessarily do more harm than good, by attributing to the covariates any and all biological effects of interest.