I'm looking for references and comments regarding the validaty of the following method for data denoising, which I found while reading a code doing analysis of some gene expression dataset. The dataset consists of columns x1, ..., xn of length m (expression levels for n genes observed in m samples). Someone with the knowledge of the dataset said that if we look at the top 10% columns with the maximum variance, we find that those columns have the maximum variance due to some artifacts (noise) in the measurement of expression level of the corresponding genes. In addition, we know that the rest 90% of columns are either not affected by any kind of noise, or are affected by some noise but to a much lesser degree than the top 10% columns.
Now, in the code that I'm examining the following method is used to remove the noise from the dataset. They calculate the principal components y1, ..., yn for variables x1, ..., xn. They took y1 (the leading principal components) and (this is my guess) assume that it mostly captures the variance caused by the artifacts described above. Then they transform the data (all n columns) using the following rule:
xi = xi - (projection of xi onto y1).
That is, from each column they remove the component that is collinear with y1, and keep the component that is orthogonal to y1.
Can anybody please provide any references for this method or comment on its applicability in this case?
You might want to look at approaches like ComBat and SVA to remove biases in a statistically-controlled way. Alternatively, you could model the "noise" as a covariate in a linear model. Knowledge of the experimental design will be important to know to what extent the "noise" would be expected to affect results.
I agree, look at SVA, it does basically what (I think) you're describing except on the residuals of the model that you specify. Some have shown ComBat or PEER to do a better job at batch removal, so you could have a look at those as well.