Question

Gene expression distribution (non-DE and DE)

0

Entering edit mode

6.6 years ago

CY ▴ 750

I know there are lots of discussion about this. But all sorts of distributions were mention regarding both differential expression gene and non-differential expression gene. I have not found anything conclusive. Maybe this is still an unsettled issue. But here is what I thought:

I have been under the impression that not differentially expressed gene would follow negative binomial distribution due to biological / technical variation. In the cancer sample pool, over / under expressed genes would follow negative bibomial distribution as well maybe with even larger variance. In the cancer / normal mixed sample pool, differentially expressed gene may follow bimodal distribution representing normal expression and tumor expression respectively. Do I understand these right?

RNA-Seq differential expression • 1.5k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 89k • written 6.6 years ago by CY ▴ 750

score 1 · Answer 1 · 2018-10-16

1

Entering edit mode

6.6 years ago

Kevin Blighe 89k

I am not sure where you have read that (?). RNA-seq counts naturally follow a negative binomial distribution. During analysis (in DESeq2, at least), each gene's distribution is independently modeled as a negative binomial in a model that includes the dispersion estimate. Read more on the dispersion, here: Clarification on how DSEeq2 Dispersion Curve is Generated

Once the model is 'fit' for each gene, we can then derive a p-value for each gene via the Wald test applied to each gene's coefficient in the each model. Read more:

Kevin

ADD COMMENT • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes, the distribution of normal sample pool follows negative binomial distribution. But what about the distribution of tumor sample pool? I imagine there will be no pattern to follow since each tumor sample may have its own disturbed / distrinct expression pattern, right?

ADD REPLY • link 6.6 years ago by CY ▴ 750

1

Entering edit mode

Well, if the gene is EGFR and we have EGFR-positive lung cancers, then EGFR distribution will of course differ between the tumours and normals, and this would reflect as a positive or negative coefficient from the fitted model, along with a statistically significant p-value from the Wald test. If you just look at tumours, you cannot really make any inference about EGFR without subdividing the tumour samples and again comparing the distributions.

Of course, the negative binomial model fit is not optimum for all genes, but published work shows that it results in least false positives. The other way would be to go gene-by-gene, check it's optimum distribution, and fit a model accordingly. Apart from being computationally intensive, this also introduces bias.

I'm not the best to answer these questions, though. Wolfgang Huber is the first name that comes to mind.

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

You have already clearified a lot. Really helpful for me! Thanks a lot!

ADD REPLY • link 6.6 years ago by CY ▴ 750

0

Entering edit mode

No problem CY - thanks.

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k