I came to know that RNA-seq data follows bionomial/ negative bionomial distribution. Well i am not a statistician, but i studied about basics of statistics, statistical terms, probability, distributions and statistical tests on internet.The text available on internet use coin flipping, playing cards, throwing dice type of examples which helped me to understand the statistics (well i say basic statistics) behind it .but when i come to RNA-seq data i am not able to correlate and comprehend.
Can anyone explain (or provide me a link) RNA-seq data distribution (eg. bionomial / negative bionomial) and statistical (eg. T test) test taking an example of RNA-seq count/FPKM data, where we have input parameters:
1.Number of genes in organisms
2.Number of reads mapped on these genes
Thanks in Advance :)
I don't think you will find a derivation for why the negative binomial is used for RNA-Seq in the same way for example the binomial distribution would be used to model card games or Poisson would be good to model the number of customers per hour. In real life the number of reads counted for any gene tends to vary between individuals more than the Poisson distribution (what is usually used for count data) would model. The negative binomial is used because it is more accurately matches what is observed than Poisson. As frustrating as this sounds it is still better than microarrays.
read DESeq and edgeR paper. It's well explained in it
Look at the 5th response (by Simon Anders) in this Seqanswers forum post.