Hello All,
This is another question about RNA Seq data normalization. Often, I have read papers using ERCC spike in as control for identifying experimental bias that may occur due to RNA species length and concentration. Then I came across this paper: Revisiting Global Gene Expression Analysis, where they talk about "Transcriptional Amplification".
The key message is proposed in this figure
They have demonstrated using cell lines that with usual RNA Seq experimental and normalization, we may not detect differentially expressed genes effectively (?) in cases where we have transcriptional amplification. The proposed solution here is to use the ERCC spike in standards proportional to cell number and then normalize accordingly
I am wondering how do we handle such a scenario, when we perform such an experiment in Tissue Samples, where we cannot determine the number of cells. We start with same quantities of total RNA for library preparation and do not account for the spatial gene expression patterns/transcriptional amplification.
Are there any controls or data handling procedures that is in use already? Any new strategy would be nice to discuss.
May be we can ERCC to normalize for tissues as well, but how?
I think this paper raises some very important issues. It has made us encourage all collaborators to do ERCC spike-ins by default now.
However, your question is very pertinent. I don't think there is any way to do it. The authors of the paper suggest doing DNA quantification as a surrogate, but I'm not sure how that would work in practice?
I feel like this experimental design is, in a way, trying to answer two separate questions with one approach.
Usually in a standard differential expression experiment, when transcriptional amplification is not considered, you would be trying to find out a subset of genes that are deferentially expressed to indicate, e.g. activation of a certain pathway. We will call that Question 1.
Using spike ins like this tells you if there is transcriptional amplification. We will call that Question 2.
It seems to me that a side effect of answering Question 2 like this is that you lose some information about Question 1. In at least the simple schema of the figure, all of the genes are going to be called differentially expressed because there is universal amplification.
But to get back at Question 1, I believe you would still have to do a second normalization of the data using a more conventional approach that normalizes the two conditions to the same level, under the assumption that you would still expect to see a proportional increase in certain genes if certain pathways were activated in a test condition, even if the cells themselves were bigger and had more RNA in them.
I don't know a better solution, though.
Certainly, in an RNA-Seq experiment of case vs control, we would like to capture both Q1 and Q2. If we conveniently ignore Q1 (or Q2) how relevant and accurate is our final set of genes to our experimental objective?
I can agree that we would have to perform two step normalization as you propose, but I wonder how this would transform the data...