Question

How to remove multiple batch effects from RNA-seq data before limma differential gene expression analysis?

0

Entering edit mode

7 months ago

t.fortunato.asquini ▴ 10

Hello,

I am fairly new to RNA-seq data analysis and I am trying to remove batch effects from my RNA-seq dataset. The thing is that my study design is quite complex and I don't have clear experimental groups. Indeed my cohort consists of ~150 donors and I am using dimensionality reduction approaches on their clinical data to identify consistent pathological profiles; then I use the gene expression data to see how it correlates with the profiles I have found.

As of now, I know that age, sex, RQN and library size represent for sure confounding variables in my dataset. So far, I have accounted for their effect by simply modelling them in the design matrix like described below:

model.matrix(~0+group+age+sex+RQN+lib.size, data=data)

However, as I was saying, my group variable is not fixed. Therefore, I am afraid that the 'batch effect' of the nuisance variables changes at every attempt given the different correlation of such variables with the groups I use from time to time. This also makes it difficult to actually pinpoint the effect of the groups I am interested in (and which perform better).

For all of the above reasons, I would like to remove the batch effect upstream before performing any DEG analysis. My reasoning (which I hope is correct) is that in this way the groups I am comparing will have the same 'starting point' in terms of gene expression values. For this task I have been looking at ComBat-seq and limma::removeBatchEffect. These are my doubts for each of them:

ComBat-seq: this would be perfect for what I want to do but as I understand only one 'batch effect variable' can be specified. As I have explained, I have several. I don't know whether there is a way around this or there are similar tools that allow to specify multiple nuisance variables.
limma::removeBatchEffect(): I have read this post (super useful!) that explains what the function does. If I understand correctly, the removal of the batch effect is still embedded in the limma workflow and the estimation of their coefficients and those of the variables of interest are interdependent. If I change the group variable, the coefficients of the nuisance variables are going to change as well.

Is my understanding of the ComBat-seq and limma::removeBatchEffect() functions correct? Is there a better approach to remove the nuisance variables before doing DEG analysis with limma? Thanks in advance!

ComBat-seq limma RNA-seq removeBatchEffect batch-effect • 702 views

ADD COMMENT • link updated 7 months ago by Ram 44k • written 7 months ago by t.fortunato.asquini ▴ 10

1

Entering edit mode

Include confounders into the limma design. No need for upfront removal. Note that libsize is taken care of by normalizing usually. Are you sure you want this in the design? It's uncommon.

How did you diagnose that a variable is a confounder? Sex and age a 'clear' confounder over all the potential confounders a human dataset can habe? Thats again at least questionable.

ADD REPLY • link 7 months ago by ATpoint 86k

0

Entering edit mode

Thanks for the quick response! Yes I know that including lib.size in the design is uncommon and I have doubts about this. However, after normalisation I still see that it strongly correlates with the first PC (PCA done on logCPM), as does sequencing library concentration. My reasoning is that different library concentrations (in my cohort it goes from 0.5 to 200 nM) have an impact on the composition of the detected transcriptome and that this was not properly corrected for by the sequencing company. As a consequence, more concentrated RNA samples: (1) were sequenced more deeply, (2) have different transcriptomic composition altogether. At this point I don't know whether I should correct for library concentration (as it represents the root source of variation) or for library size (as the transcriptomic differences from sample to sample were 'transformed' by varying amplification steps).

How did you diagnose that a variable is a confounder?

That's actually a very good question. At the moment I am relying on the correlation between different variables and the first PCs (as I was mentioning above) to see what is the general impact on gene expression. I know this doesn't necessarily identify a confounding variable, as having the same effect in both 'groups of interest' doesn't lead to any significant changes. My disease group is overall younger due to premature death, so I am including age in the design. Sex is actually distributed quite well across groups, and I may remove it from the design. However, as I said, my groups of interest are not clearly defined, and I am afraid that at some point I will have groups with different sex distribution.

Do you have any suggestions on different ways to identify confounders?

Thanks again for the reply! I will simply add the nuisance variables to the design as you suggested.

ADD REPLY • link 7 months ago by t.fortunato.asquini ▴ 10