Question

Differential gene expession analysis in cell populations of mixed tumor and normal cells

5

Entering edit mode

10.1 years ago

Christian ★ 3.1k

I am wondering about the correct experimental design for differential gene expression analysis in the following atypical setup.

We have several patients from which we have three RNA-seq measurements each: one from pure tumor tissue (T), one from normal tissue (N), and one from a tissue that contains a (known) mixture of tumor and normal cells (M).

We are interested in genes that are differentially expressed between T and M, but only in the tumor cell fraction of M! Due to the contamination of M with normal cells, the naiive comparison of T with M yields too many DEGs, because many genes are differentially expressed between tumor cells in T and normal cells in M. I am therefore looking for a way to somehow "subtract" the known gene expression signature of N from M to get a clearer picture of the genes that are differentially expressed only in tumor cells.

Can this somehow be accomplished using a multi-factorial experimental design in DESeq2 or EdgeR? Or should I just get all three possible pairwise comparisons and do something like this:

DEG(T/M) = list of differentially expressed genes between tumor and mixture
DEG(T/N) = list of differentially expressed genes between tumor and normal
DEG(M/N) = list of differentially expressed genes between mixture and normal

DEG(T/M) due to differences in tumor cells = DEG(T/M) - intersect(DEG(T/N), DEG(M/N))

The logic here is that from all genes that are differentially expressed between T and M, we exclude those that also show up as differentially expressed in both comparisons with N, as these genes represent differences between tumor and normal cells, and not between the two tumor cell populations.

If find this a somewhat unsatisfying ad-hoc solution, so I am open for any suggestion.

RNA-Seq DESeq EdgeR deconvolution cell-type • 7.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Christian ★ 3.1k

1

Entering edit mode

This is a little after the fact, but did you consider trying to find a cell surface marker and just flow-sorting the cells from the different populations before sequencing? While finding a marker might be a pain, at least then the statistics become straight-forward.

For what it's worth, the "intersect and subtract" method has issues, but it sounds like you're aware of that (unfortunately a perfect solution isn't coming immediately to me...hopefully someone else will be familiar with one).

BTW, do you have any estimates of the admixture in the mixed samples? I imagine that'd be useful in more correctly calling DE genes.

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Actually we do have surface markers that we use to enrich for tumor cells with magnetic beads, but we cannot get rid of normal cells "sticking" on the tumor cells. So the max enrichment we see is 70-80%, sometimes lower down to 30%.

The admixture rate is known and could be used to model read counts as linear combination of tumor and normal cell fractions. But then I have no idea if such a simple linear model is justified and applicable over the huge dynamic range of gene expression values.

ADD REPLY • link 10.1 years ago by Christian ★ 3.1k

1

Entering edit mode

Just found a recent review (Oct 2013) that is a great place to start to dive into this topic:

Shen-Orr and Gaujoux (2013)
Computational Deconvolution: Extracting Cell Type-Specific Information from Heterogeneous Samples

Table 1 provides a great summary of available methods, some of which look promising for my problem at hand.

ADD REPLY • link 10.1 years ago by Christian ★ 3.1k

1

Entering edit mode

Keep in mind that deconvolution methods tend to work better with microarray data, it's one of the few areas where microarrays are still better. The reason is that signals from different cell-types/tissues will sum more linearly in microarrays than RNAseq, where the sum is highly non-linear. This allows for cleaner component separation (many methods rely on ICA or similar techniques, so I'm using that nomenclature here).

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

What if we log-transform the RNA-seq data first, for example by using the regularized log transformation (rlog) or variant stabilizing transformation (VST) from DESeq2? Could we then somehow feed the transformed RNA-seq data into array-based deconvolution methods?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Christian ★ 3.1k

1

Entering edit mode

There's no transformation that will solve that problem, since it affects the original data.

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

The reason is that signals from different cell-types/tissues will sum more linearly in microarrays than RNAseq, where the sum is highly non-linear

Can you please point me toward a reference for this? I'm relatively new to the informatics scene.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by Ed Glass • 0

0

Entering edit mode

I don't have any references handy at the moment, you'll have to look them up. Any paper talking about signal separation will likely mention that the signals need to be independent for optimal performance, which they self-evidently aren't in RNAseq.

ADD REPLY • link 9.4 years ago by Devon Ryan 105k

1

Entering edit mode

I'm curious what you are referring to here. Do you mean the "signal" (counts) you get per gene within sample are not independent in RNA-seq but they are in microarray?

Is this "self-evident" because RNA-seq is inherently a competitive assay (since each gene is "competing" for read counts given the fixed depth of sequencing) vs. microarray which is not (aside from off-target effects to unintended probes)?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Yup, exactly. And yeah, this ignores the cross-hybridization issue in microarrays.

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

Ram · Answer 1 · 2014-11-25

3

Entering edit mode

10.1 years ago

Sean Davis 27k

You might take a look at this paper for some ideas:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016702/

This method might work for your gene expression dataset as well. In any case, this might be a direction worth heading.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Sean Davis 27k

1

Entering edit mode

Interesting, I hadn't seen this in the context of methylation. But there is quite a bit of literature on deconvolution in gene expression datasets derived from cellular mixtures. As one place to start, take a look at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3699332/ and the references therein.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Ahill ★ 2.0k