I am wondering about the correct experimental design for differential gene expression analysis in the following atypical setup.
We have several patients from which we have three RNA-seq measurements each: one from pure tumor tissue (T), one from normal tissue (N), and one from a tissue that contains a (known) mixture of tumor and normal cells (M).
We are interested in genes that are differentially expressed between T and M, but only in the tumor cell fraction of M! Due to the contamination of M with normal cells, the naiive comparison of T with M yields too many DEGs, because many genes are differentially expressed between tumor cells in T and normal cells in M. I am therefore looking for a way to somehow "subtract" the known gene expression signature of N from M to get a clearer picture of the genes that are differentially expressed only in tumor cells.
Can this somehow be accomplished using a multi-factorial experimental design in DESeq2 or EdgeR? Or should I just get all three possible pairwise comparisons and do something like this:
DEG(T/M) = list of differentially expressed genes between tumor and mixture
DEG(T/N) = list of differentially expressed genes between tumor and normal
DEG(M/N) = list of differentially expressed genes between mixture and normal
DEG(T/M) due to differences in tumor cells = DEG(T/M) - intersect(DEG(T/N), DEG(M/N))
The logic here is that from all genes that are differentially expressed between T and M, we exclude those that also show up as differentially expressed in both comparisons with N, as these genes represent differences between tumor and normal cells, and not between the two tumor cell populations.
If find this a somewhat unsatisfying ad-hoc solution, so I am open for any suggestion.
This is a little after the fact, but did you consider trying to find a cell surface marker and just flow-sorting the cells from the different populations before sequencing? While finding a marker might be a pain, at least then the statistics become straight-forward.
For what it's worth, the "intersect and subtract" method has issues, but it sounds like you're aware of that (unfortunately a perfect solution isn't coming immediately to me...hopefully someone else will be familiar with one).
BTW, do you have any estimates of the admixture in the mixed samples? I imagine that'd be useful in more correctly calling DE genes.
Actually we do have surface markers that we use to enrich for tumor cells with magnetic beads, but we cannot get rid of normal cells "sticking" on the tumor cells. So the max enrichment we see is 70-80%, sometimes lower down to 30%.
The admixture rate is known and could be used to model read counts as linear combination of tumor and normal cell fractions. But then I have no idea if such a simple linear model is justified and applicable over the huge dynamic range of gene expression values.
Just found a recent review (Oct 2013) that is a great place to start to dive into this topic:
Shen-Orr and Gaujoux (2013)
Computational Deconvolution: Extracting Cell Type-Specific Information from Heterogeneous Samples
Table 1 provides a great summary of available methods, some of which look promising for my problem at hand.
Keep in mind that deconvolution methods tend to work better with microarray data, it's one of the few areas where microarrays are still better. The reason is that signals from different cell-types/tissues will sum more linearly in microarrays than RNAseq, where the sum is highly non-linear. This allows for cleaner component separation (many methods rely on ICA or similar techniques, so I'm using that nomenclature here).
What if we log-transform the RNA-seq data first, for example by using the regularized log transformation (rlog) or variant stabilizing transformation (VST) from DESeq2? Could we then somehow feed the transformed RNA-seq data into array-based deconvolution methods?
There's no transformation that will solve that problem, since it affects the original data.
Can you please point me toward a reference for this? I'm relatively new to the informatics scene.
I don't have any references handy at the moment, you'll have to look them up. Any paper talking about signal separation will likely mention that the signals need to be independent for optimal performance, which they self-evidently aren't in RNAseq.
I'm curious what you are referring to here. Do you mean the "signal" (counts) you get per gene within sample are not independent in RNA-seq but they are in microarray?
Is this "self-evident" because RNA-seq is inherently a competitive assay (since each gene is "competing" for read counts given the fixed depth of sequencing) vs. microarray which is not (aside from off-target effects to unintended probes)?
Yup, exactly. And yeah, this ignores the cross-hybridization issue in microarrays.