How to regress out age and sex using limma removeBatchEffect
2
0
Entering edit mode
2.5 years ago
mropri ▴ 160

I have a protein expression data frame with a metadata data frame which includes age and sex:

nph_csf_metadata =

age sex bam tau

70 f 5 2

75 m 6 1

72 m 4 1

71 f 4 2

I want to keep the bam and tau load but remove the effects of age and sex from my protein expression data using removeBatchEffect.

I have tried this:

log_norm_prot <- log10(norm_prot) log_norm_prot <- t(log_norm_prot)

design = model.matrix(~bam + tau, nph_csf_metadata)

reg_log_norm_prot = removeBatchEffect(log_norm_prot, covariates = nph_csf_metadata$age, design = design)

This helps me regress out age but how can I regress both out at the same time using removeBatchEffect. Appreciate any help.

removeBatchEffect expression data Limma • 4.0k views
ADD COMMENT
5
Entering edit mode
2.5 years ago
Gordon Smyth ★ 7.7k

You just use removeBatchEffect with batch = sex, covariates = age and design = design. As the help page ?removeBatchEffect explains, the batch argument should be a categorical factor (like sex) and the covariates argument is for numerical covariates (like age).

Since you are already familiar with the covariates argument, I wonder why you didn't do the above already?

ADD COMMENT
0
Entering edit mode

Thank you for your help, I used covariates but was only able to regress out one variable. Will try the method you advised and also check covariates again.

ADD REPLY
0
Entering edit mode

thank you, this worked. Appreciate your help so much.

ADD REPLY
0
Entering edit mode

Hello,

I am dealing with a bit similar question, using limma to look for DEG between two genotypes (I need to adjust for sex and rin). I tried two methods and was hoping the results from the two methods should be the same. However, it turned out to be different.

My first method is firstly regress out the effects of sex and rin and then do lmFit using only genotype design.

design2<- model.matrix(~genotype, data = metadata)
resids <- limma::removeBatchEffect(rawcount, batch = metadata$sex, covariates = metadata$rin, design = design2)
fit1<- lmFit(resids, design = design2)
fit1<- eBayes(fit1)
topTable(fit1, n=5, coef = 2)


 logFC   AveExpr        t      P.Value    adj.P.Val
ENSMUSG00000041702 0.4989305 -3.277418 7.797293 4.865979e-08 0.0007863908
ENSMUSG00000023845 0.4915813 -7.805336 7.124747 2.267034e-07 0.0017263994
ENSMUSG00000101299 1.1291242  1.577017 6.976884 3.204751e-07 0.0017263994
ENSMUSG00000036412 1.6012259  4.197494 6.647425 6.998824e-07 0.0022593589
ENSMUSG00000071793 0.4637242 -2.876400 6.522338 9.447527e-07 0.0022593589
                          B
ENSMUSG00000041702 8.321131
ENSMUSG00000023845 6.951328
ENSMUSG00000101299 6.640689
ENSMUSG00000036412 5.936721
ENSMUSG00000071793 5.665296

The second method is to put all the factors in lmFit and extract the stats for genotype factor.

design<- model.matrix(~ genotype+sex+rin, data = metadata)
> fit<- lmFit(rawcount, design)
> fit<- eBayes(fit)
topTable(fit, n=5, coef = 2)



 logFC   AveExpr        t      P.Value   adj.P.Val        B
ENSMUSG00000041702 0.4989305 3.4930444 7.077032 4.162659e-07 0.006727273 6.254381
ENSMUSG00000023845 0.4915813 3.7044415 6.470208 1.623163e-06 0.011414309 5.073041
ENSMUSG00000101299 1.1291242 0.7638847 6.353731 2.118862e-06 0.011414309 4.839656
ENSMUSG00000036412 1.6012259 1.1705371 6.055775 4.220584e-06 0.014002018 4.233413
ENSMUSG00000026452 0.7866513 5.9985849 5.926864 5.704534e-06 0.014002018 3.967167

I am wondering why the results from the two methods are different (logFC are the same, but p value are different). Many thanks.

ADD REPLY
0
Entering edit mode

The batch correction should be done as part of the linear model, not prior to the linear model. So the second method is right and the first is wrong. There have been many posts about this over the years, especially in the Bioconductor support site.

The help page ?removeBatchEffects says that it should not be used as input to lmFit. removeBatchEffect is only for plotting and exploration purposes, not for linear modelling and differential expression.

ADD REPLY
0
Entering edit mode

Thank you so much. I used the second method for DEG identification. But I would also like to get the expression data after removing the effects of sex and rin, because I want to plot heatmaps to show the expression of some selected genes expression affected only by genotype.

Can I ask how to get the expression data that are only affected by genotype? Many thanks.

ADD REPLY
0
Entering edit mode

Can I ask how to get the expression data that are only affected by genotype?

Yes, that is what removeBatchEffect is for (as I think I said in my previous comment), for when you want to make a plot with the batch or covariate effects removed.

ADD REPLY
0
Entering edit mode

Yes I just learned the removeBatchEffect function from ur previous comments 13 months ago. And i wanted to confirm that my understanding was correct, so I compared the results from the two methods and thought if the results are the same then my understanding is correct. But now I learned that this way of confirming is wrong. thanks a lot for your enlightenments.

ADD REPLY
0
Entering edit mode
2.5 years ago
machan • 0

afaik, limma::removeBatchEffect can accept two batches i.e specifying batch = (first batch) and batch2 = (second batch)

ADD COMMENT
0
Entering edit mode

so would first batch be batch = nph_csf_metadata$age, and second batch2 = nph_csf_metadata$sex?

ADD REPLY
0
Entering edit mode

I tried that and it computes but I feel like its not correct because it is giving me many values of infinity. Is there a way to use the covariate to regress out age and sex at the same time?

ADD REPLY
0
Entering edit mode

Do you need the corrected assay? cause the removeBatchEffect affects only the assay for the visualization hence you need to extract it from the variance-stabilized assay - otherwise you can use

  1. combat seq to remove the effect of the age or sex
  2. you can try having your design as: ~age+sex+[the_rest_of_your_model]
  3. check if there is something like the log likelihood ratio test for you case where you will use both the null and the full model;
ADD REPLY
0
Entering edit mode

Will check that out. However, since this is proteomic expression, variance stabilized transformation is not the correct way to normalize the data. It was normalized, then log-transformed to correct the skew of the data, and then I just want to remove age and sex as variables to not have an affect on the data.

ADD REPLY

Login before adding your answer.

Traffic: 2695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6