Hello,
I need to normalize gene expression (RNASeq) data to perform downstream analyses on it:
- unsupervised deconvolution using ICA
- pathway quantification using decoupleR. I want to see which pathways are important in my cohort and if some pathways are specific to subgroups of clinical features (histology, stage, ...).
Actually, to perform unsupervised deconvolution, I do not have to worry about batch effect since ICA should capture a batch in one component. But I have to pay attention to this to perform pathway quantification.
In my dataset, I have a strong batch effect since 2/3 of the samples were captured using the Kapa kit and the ones which failed were captured using the Takara kit.
I used DESeq2 in several studies some years ago to perform DEA and data visualization ago but, I did not have to remove a batch. I would appreciate advice to get an expression matrix of the normalized counts, corrected for this batch effect. I discussed already the input needed for decoupleR here. It seems possible to use normalized, log normalized or vst counts.
My first idea was to get corrected log normalized counts but it does not seem possible. If I am correct, it is impossible to simply put the batch in the design of the model to do that, since the normalized counts are not affected by the model.
Then, I saw the following solution: after normalization and dispersion estimation, use vsd and limma::removeBatchEffect
. I read that it should be better to use vst(dds, blind = FALSE)
.
Using blind=FALSE
, the sample information provided in the design formula are used.
I wanted first to provide only the batch information in the design formula. Nevertheless, to explain the counts, it seems better to provide the main clinical information (gender, histology, grade, age, ...) and batch.
My question is the following: if I introduce all these variables, can I somehow introduce a bias / a signal? Is it the best strategy to perform an unsupervised pathway analysis afterwards?
I am not sure my concern is very clear, but the idea is that I want to provide input data to decoupleR on which there is no batch effect and no effect added artificially when computing vst.
If it is the way to go, could you recommend "how many" variables should be used in the model? I have at least 6 of them, but from the unsupervised deconvolution, I see that some have no effect. Should I use only the 2 - 3 - 4 with the highest effect (+ the batch)?
Thanks in advance for your feedback
jared.andrews07, thank you for your feedback anf advice! I have never tried ComBat-seq but have heard about. I will try, thanks. Out of curiosity, why would you recommend ComBat-seq over limma::removeBatchEffect here? Is it to apply a minimum set of changes on the data? (I guess vst transformation is not needed). Or is it for the batch correction by itself? Thanks in advance
Mostly this. It adjusts the raw counts, so you have more options downstream given you get integer counts back out. In terms of the actual correction, I've never used limma's function, so I can't really comment on that. I presume it probably works fine.
Ok, thanks a lot for your feedback.
Hello jared.andrews07
Sorry, I have 2 more questions for you. I am now testing ComBat-seq.
A. I wonder if, prior batch correction, it is useful to remove genes that are all 0 in one of the batches. Actually, this question holds for any batch correction method. In my experiment, ~40 samples were sequenced using the Takara kit and ~80 using the Kapa kit. ~5000 genes are 0 for all samples sequenced with the Takara kit and ~4000 genes are 0 for all samples sequenced with the Kapa kit. (There are only ~2000 genes that are 0 for all samples)
Is it usefull to perform this filtering prior to batch correction or do these genes actually help for the batch correction and should be kept in the matrix?
B. After removing the genes that are all 0 in one of the batches, I ran:
then, I want to use DESeq2 to perform normalization:
but I get the following error:
There is no NA in adjusted_data:
Do you know what could be the issue?
Thanks in advance for your help
I just solved my issue. As pointed out in other discussions, some gene expressions were very high (e12, even e16). For those you would have the same issue, I was more stringent on the genes filtered before applying Combat-Seq, requiring more reads to keep a gene and this solved the error, as the gene expressions decreased significantly.