My data consists of 20,000 rows (genes) and 300 columns (samples). 5 out of 300 are cell lines and 295 out of 300 are tumor samples.
I'm currently attempting to adjust the expression values according to a confounding variable using linear regression. In summary, I have a dataframe consisting of the gene expression values and a vector of values of the confounding variable for each of the 300 samples.
Below is my first attempt at this:
design = model.matrix(~group+confounder)
fit = lmFit(df, design)
adjValues = fitted(fit)
The resulting design matrix looks something like:
(Intercept) group confounder
Sample #1 1 0 1
Sample #2 1 0 1
Sample #3 1 0 1
Sample #4 1 0 1
Sample #5 1 0 1
# ======================================== Above is cell line; Below is tumor sample
Sample #6 1 1 .91
Sample #7 1 1 .75
...
I thought it would be straight forward, but doing this results in a weird problem where expression values of the 5 cell lines are identical for all of the genes. This seems to also happen when I change the design matrix into design = model.matrix(~confounder)
.
What is the problem with how I am currently employing linear regression to adjust gene expression accounting for the given confounder?
To clarify, I'm attempting to replicate the correlation analysis that was done in the paper below: https://www.biorxiv.org/content/biorxiv/early/2018/09/20/422592.full.pdf
One of the steps that they perform is adjusting the expression levels of all samples according to tumor purity using linear regression, which I am trying to replicate. Now I'm wondering if it would be more appropriate to run regression on each of the samples instead.
Taking the time to critically assess papers (and answer questions about your own work) is important, and suspect this requires people to be able to study a limited number of topics in-depth.
In other words, I am not immediately sure what to say about this specific paper. However, if I have a chance to review the paper with some thoughts, then I will update with an additional comment.