I have RNA-seq data samples regarding several types of cancer that came from the same source, and RNA-seq of control samples from a different source. For the cancer data and for the control data, I run cell type enrichment analysis separately to get an idea about the cell structure of each sample.
Cell type enrichment analysis for the cancer RNA-seq data is called cancer1
, and for the control it's called control1
.
The reason I'm applying batch effect is that I have two sources, and I don't want any variation that is not biological.
I'm having some trouble with batch effect correction, when and how should I apply it.
Should I apply limma::removeBatchEffect
on cancer1
and control1
separatly? or should I combine them and only then run the correction?
Or is it better to run the correction before even running the cell type enrichment analysis? like run it on the RNA-seq data to begin with.
Your help will be much appreciated, thank you.
Based on your description (all cancers from source 1, all controls from source 2) you have what's called "perfect counfounding", so if you remove the batch effect, you'll remove the average difference between cancer and control. It is very difficult to control for batch effects while retaining biological variability in this situation; even when you've collapsed down to cell type enrichments.
Not very difficult, it’s impossible.
So you say I should not apply batch correction at all right?
It's not that you should not, you cannot. Batch and biological effect are the same. If all apples are in box1 and all peers are in box2 and you remove box1 then all apples are gone because the type of fruit (apple/peer) and the location (the box) are perfectly nested, you remove the box1, so you remove the apples, they would need to be mixed in boxes to allow removal of a box (sorry for the pathetic comparison).
ATpoint This reminds me of another question, say I have many bulk RNA-seq cancer datasets. For example I have 5 melanoma datasets, 4 RCC datasets, and so on..
I perform cell type enrichment analysis for each dataset separatly, then I combine the results into one big dataset.
Can I perfom batch effect correction for cancer type? and should I perform batch effect correction for dataset?
I want to check a general pattern of cells in all cancer samples in general, so I don't want the cancer type or the source (dataset) to influence the results.
You can only ensure this via experimental design; you can't do it in-silico. You can perform the analysis without correction, but until a follow-up validation experiment is performed with the appropriate design, you cannot assign any statistical confidence to the results.