Hi, Check my understanding towards how Poisson distribution is employed when finding peaks of CHIP-seq and CLIP-seq. It is well known that the number of times for a base is sequenced follows a Poisson distribution. Just like people going to supermarket using a particular entrance in a given period of time. Poisson distribution can plotted as following:
Here the average is 7 (lambda=7) plotted in red line. And green line denotes the edge of probabiliy is 0.975.
From the genome wide scale, coverage = Read Length (nt) * Total Reads Number * / *Genome Length (nt). Indicating the average number of reads that hit a base. So assign the coverage as the lambda (or mean) of Poisson distribution, let's say also 7 here. x-axis means the reads number for a base, and y-axis means the probability of a given reads number.
Thus, we can know when the probability is 0.975, the reads number is <= 13. If the reads number detected in real CHIP-seq/CLIP-seq is larger than 14, we will know it is almost impossible, as long as the reads follow poisson distribution. However, in the real experiment, we detect for a position, the reads number is, let's say 20. Thus, how to explain this result? It is because here is the enrichment induced by chromatin binding protein (in CHIP-seq). Those reads are not random distributed. Thus, it can get a p-value for reads number = 20 by calling ppois(20-1,lambda=7,low.tail=FALSE) in R.
Am I right? 'Cause I don't want to get wrong understanding.
To add to what Istvan says, the distribution of reads in a sequencing experiment is usually NOT Poisson, because of any number of biases (GC-content, mapability, others we don't fully understand). It's usually better approximated by a negative binomial distribution, which is essentially an overdispersed Poisson.