Question

RNA-Seq data distribution

2

Entering edit mode

8.5 years ago

ebrudermanver ▴ 100

In the papers I read, it is usually claimed that since RNA-Seq is count data, a Poisson or negative binomial distribution would be the most suitable ones to model the RNA-Seq data. However, as a computational biologist, none of the RNA-Seq data I have seen so far is composed of integers. All RNA-Seq datasets I have seen contain decimals, which is probably because there is a standard normalization process applied to the raw read counts, which is crucial. This normalization process usually adjusts for sequencing depth and also for overdispersion. So, my question is, how come we can model those decimal numbers with Poisson or negative binomial? As I said, I have never seen processed (or normalized) RNA-Seq data that contain integers. What am I missing?

RNA-Seq distribution • 4.1k views

ADD COMMENT • link updated 5.3 years ago by Biostar 20 • written 8.5 years ago by ebrudermanver ▴ 100

score 1 · Answer 1 · 2016-10-10

Most softwares (DESeq2, edgeR) model the raw counts rather than normalized counts by dividing out the size factors.That's the assumption for NB model and you will break the mean-variance relationship of NB if you take normalized counts as input. If you have FPKM which is in decimals, you cannot use them directly for any discrete model. My suggestion is try to get the raw counts for NB model, or switch to limma with eBayes(trend=T) with FPKM.