I have very frequently seen in the papers, lecture notes etc. that RNA-Seq data can be modeled by a Poisson distribution, and microarray data can be modeled by a Gaussian distribution, and I haven't given much thought on that. But I recently realized that I don't really understand what that means. Let's say we have a 100 x 20K matrix of RNA-Seq counts where rows represent samples (say, lung cancer patients) and columns represent genes. Then, do we assume that the set of 100 values in each column (gene) would follow a Poisson distribution? Or do we assume that the set of 20K values in each row follows a Poisson distribution? Or, each gene-sample pair is distributed by a separate Poisson with a separate mean? If the last is true, then we have no idea how to compute the mean and variance of those 2 million different distributions, because we have only a single value from each of them.
Also, I have seen many papers where the microarray data is modeled by a p-variate Gaussian distribution where p is the number of genes, although it looks like microarray data is usually assumed to be distributed by a univariate Gaussian. What is the reason behind the multivariate assumption? Does multivariate Gaussian lead to a more accurate modeling of the data?
As you can see, I am totally confused. Can someone explain those in an intuitive and least technical way possible (I am not a statistician)?
So, relating this to my actual question, are you saying that each column (gene) in the example I gave is distributed by a Gaussian distribution? Why not each row (sample)?
A sample x gene matrix represents the measured expression levels of the genes in each sample. Depending on how this results was arrived at, each value can be seen as being generated by a Gaussian distribution or even as the mean of such a distribution. In such cases, the rows (samples) can be modeled by a multivariate Gaussian composed of the distributions of all the genes. There is no reason to assume the values in rows/columns to be drawn from the same distribution. If you assume that each row (sample) can be modeled by just one Gaussian, then on average, all genes would have the same expression level. If each column (gene) is modeled by one distribution then each gene will have on average the same expression level in each sample.