Hi,
I am really bad in stats so I am really sorry if this question is inappropriate or too stupid (also I wasn`t if this was the right forum...if not, apologies again!).
A collaborator asked me to correct for age and sex using linear regression) our bulk-RNAseq dataset (6 human samples with 21000 genes total, mixed-sex and age). The aim is to take into account the effect of these 2 (potential) confounding variables and to (re)perform the remaining analysis on the dataset. So I did the linear regression as follow:
model <- lm(Gene ~ age +sex , data=df)
but now I am confused which "data" I need to take from the model
to "replace" my original data.
If I need to re-do (for example) cluster analysis or PCA with the genes corrected by age and sex, which values do I have to take? the residues or the fitted/predicted values or ..?
Basically, the residuals are what is left over after fitting a model so they are basically the error since :
Residual = Observed value - Predicted value
For this reason the residual can be positive if t above the regression line and negative if below, so I cannot take the residual to perform downstream analysis or do I have to take the absolute values of the residues? So I was thinking that probably I should take the fitted/predicted values, or am I wrong?
Apologies, for the question (if too stupid)
Camilla
thank you for your the wake-up call (if you can do it with my boss as wells tit would be great making my life and my PhD easier..). I am NOT a bioinformatician so, as you highlight, I am fully aware I do not have the necessary experience and background knowledge. I just think it`s nice to know what I am doing and why, rather than copy and paste someone else line in my script. I can understand you guys get annoyed/bored with the same basic question so feel free to delete any post you think is redundant and inappropriate or take any action you need to.
Don't worry, there is no need for deletion, the questions are fine. PIs can be stubborn at times, but often they are satisfied once results come in, so why not just trying limma and present that. It is in fact a linear model it uses so just use it, present the results which will be statistically way more accurate than custom solutions, and just tell them you used a well-established highly cited linear model-based framework, would that be ok?
hi ATpoint , I came across this post while looking for ways to solve a similar question that I have, would be happy to hear any input that you may have about it! Note that my question, unlike OP's in here, is on using linear models through limma for adjusting expression.