I know there are lots of discussion about this. But all sorts of distributions were mention regarding both differential expression gene and non-differential expression gene. I have not found anything conclusive. Maybe this is still an unsettled issue. But here is what I thought:
I have been under the impression that not differentially expressed gene would follow negative binomial distribution due to biological / technical variation. In the cancer sample pool, over / under expressed genes would follow negative bibomial distribution as well maybe with even larger variance. In the cancer / normal mixed sample pool, differentially expressed gene may follow bimodal distribution representing normal expression and tumor expression respectively. Do I understand these right?
Yes, the distribution of normal sample pool follows negative binomial distribution. But what about the distribution of tumor sample pool? I imagine there will be no pattern to follow since each tumor sample may have its own disturbed / distrinct expression pattern, right?
Well, if the gene is EGFR and we have EGFR-positive lung cancers, then EGFR distribution will of course differ between the tumours and normals, and this would reflect as a positive or negative coefficient from the fitted model, along with a statistically significant p-value from the Wald test. If you just look at tumours, you cannot really make any inference about EGFR without subdividing the tumour samples and again comparing the distributions.
Of course, the negative binomial model fit is not optimum for all genes, but published work shows that it results in least false positives. The other way would be to go gene-by-gene, check it's optimum distribution, and fit a model accordingly. Apart from being computationally intensive, this also introduces bias.
I'm not the best to answer these questions, though. Wolfgang Huber is the first name that comes to mind.
You have already clearified a lot. Really helpful for me! Thanks a lot!
No problem CY - thanks.