Covariate correction which data take for downstream analysis?
2
0
Entering edit mode
3.7 years ago
camillab. ▴ 160

Hi,

I am really bad in stats so I am really sorry if this question is inappropriate or too stupid (also I wasn`t if this was the right forum...if not, apologies again!).

A collaborator asked me to correct for age and sex using linear regression) our bulk-RNAseq dataset (6 human samples with 21000 genes total, mixed-sex and age). The aim is to take into account the effect of these 2 (potential) confounding variables and to (re)perform the remaining analysis on the dataset. So I did the linear regression as follow:

model <- lm(Gene ~ age +sex , data=df)

but now I am confused which "data" I need to take from the model to "replace" my original data. If I need to re-do (for example) cluster analysis or PCA with the genes corrected by age and sex, which values do I have to take? the residues or the fitted/predicted values or ..? Basically, the residuals are what is left over after fitting a model so they are basically the error since :

Residual = Observed value - Predicted value

For this reason the residual can be positive if t above the regression line and negative if below, so I cannot take the residual to perform downstream analysis or do I have to take the absolute values of the residues? So I was thinking that probably I should take the fitted/predicted values, or am I wrong?

Apologies, for the question (if too stupid)

Camilla

R linear-regression residual fitted-values • 1.8k views
ADD COMMENT
2
Entering edit mode
3.7 years ago
ATpoint 85k

The list of posts on the same underlying problem gets longer and longer, why don't you simply use limma-trend as I suggested you before, for the reasons that 1) you do not seem to have the necessary experience and background knowledge to perform a custom analysis and 2) custom analysis is not necessary as expert software for this very common problem exists, is more reliable and also statistically more powerful.

Please don't feel offended, this is not meant to call you out, it is rather a wake-up call. The analysis you seem to be doing for weeks now can be done literally in five lines of code via limma, and it will take care of all the issues you're facing. The code for limma-trend is even there in this linked post https://support.bioconductor.org/p/125144/ (pipeline number two), just take your FPKMs, put on log2 scale, add the covariates you want to control for and analysis will be done in like no time.

Your previous threads on the same topic:

PCA - untreated sample cluster separated

p.adjusted p value

NO FDR <0.05 in the whole dataset

Error on Linear regression bulkRNAseq

Error in y - mu : non-numeric argument to binary operator - linear regression loop

ADD COMMENT
0
Entering edit mode

thank you for your the wake-up call (if you can do it with my boss as wells tit would be great making my life and my PhD easier..). I am NOT a bioinformatician so, as you highlight, I am fully aware I do not have the necessary experience and background knowledge. I just think it`s nice to know what I am doing and why, rather than copy and paste someone else line in my script. I can understand you guys get annoyed/bored with the same basic question so feel free to delete any post you think is redundant and inappropriate or take any action you need to.

ADD REPLY
0
Entering edit mode

Don't worry, there is no need for deletion, the questions are fine. PIs can be stubborn at times, but often they are satisfied once results come in, so why not just trying limma and present that. It is in fact a linear model it uses so just use it, present the results which will be statistically way more accurate than custom solutions, and just tell them you used a well-established highly cited linear model-based framework, would that be ok?

ADD REPLY
0
Entering edit mode

hi ATpoint , I came across this post while looking for ways to solve a similar question that I have, would be happy to hear any input that you may have about it! Note that my question, unlike OP's in here, is on using linear models through limma for adjusting expression.

ADD REPLY
1
Entering edit mode
3.7 years ago
Martombo ★ 3.1k

You should add your variable of interest in your linear model:

model <- lm(Gene ~ age + sex + contrast , data=df)

Also, 6 patients are very few to estimate an age- or sex-related effect

ADD COMMENT
0
Entering edit mode

Also, 6 patients are very few to estimate an age- or sex-related effect

Despite I am not very good in stat, I totally agree but I cannot find a paper to prove that there are too few samples...

What do you mean the variable of interest? I would like both sex and age per gene so technically that`s my code:

#select genes
genelist <- df%>% select(5:21015)
#new dataframe
newD <- df %>% select(age, sex)
for (i in 1:length(genelist)) {
  formula <- reformulate(c("age", "sex"), response=names(genelist)[i])  
  model <- lm(formula, data = df)
  newD[[names(genelist)[i]]] <- predict(model, newdata=newD)
}
newD
ADD REPLY

Login before adding your answer.

Traffic: 2958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6