Question

How to use Poisson distribution to get meaningful expression counts?

2

Entering edit mode

3.7 years ago

c_u ▴ 530

Hello,

I have a list of enhancers elements that I care about (about 70,000 in number, each exactly 3000 bp long in terms of genomic location). I also have total RNA-seq reads from an experiment (3 replicates, mouse), which I have mapped and gathered counts for these enhancer elements using featurecounts. Now, I want to know which of these enhancers have some meaningful expression level. For this, my plan is to get the basal expected expression level, and then if the expression level for an enhancer is greater than this, I would call it as being truly expressed. I have heard that Poisson distribution can be a good tool for such count-based modeling, but I am not sure how to go about it. Could anyone provide any help/ideas?

PS - I am completely OK with an approximate method for now, I am not trying to be too precise since it is a preliminary identification of expressed enhancers.

Thank you in advance!

statistics rna-seq • 2.4k views

ADD COMMENT • link updated 3.7 years ago by i.sudbery 21k • written 3.7 years ago by c_u ▴ 530

2

Entering edit mode

poisson could work in so far as it is count data, but I think most people wouldn't recommend such an approach because it would be anticonservative.

you might consider a permutation based approach that relies on your data ... or alternatively modeling with negative binomial (thought to be slightly better due to the ability to model over-dispersion) or even zero-inflated negative binomial.

In general I think a permutation based approach might have the fewest theoretical problems and be the easiest to implement..

ADD REPLY • link 3.7 years ago by LauferVA 4.6k

0

Entering edit mode

Hi Vincent, thank you so much for the response. I actually am OK with being somewhat liberal/anticonservative in my calculation, since enhancer RNA counts are generally quite few in total RNA seq data. In either case, my question is a bit more basic, which is to understand how to implement such a Poisson/permutation-based approach (also I am not quite sure what you mean by a permutation based approach). I am sure its quite easy to implement as you point out, and I do have some statistical background, but I just haven't ever done this kind of analysis (i.e. finding significantly expressed read counts based on a model). Could you briefly explain how this is done? Thank you so much!

ADD REPLY • link 3.7 years ago by c_u ▴ 530

1

Entering edit mode

Poisson distribution can be appropriate to model count data, but it assumes that the variance is equal to the mean for the expression of a given gene. It is almost never the case, especially with RNA-seq data for which there is typically over-dispersion (variance > mean). Hence, if you use Poisson to model RNA-seq, that would lead to many false positive calls of differential expression. Instead, we often use an extension of the Poisson distribution that can take into account the over-dispersion of the data: the negative binomial distribution. But no need to reinvent the wheel, there are great methods out there (DESeq2 or edgeR for instance) that implement that distribution to analyze RNA-seq data.

ADD REPLY • link 3.7 years ago by Carlo Yague 9.0k

0

Entering edit mode

Hi Carlo, thank you so much for your response. I am actually OK with an approach that may give some potentially false positives, since enhancer RNA counts are usually under-represented in total RNA seq data. I have worked with DESeq2 before, but in my understanding, its a) used to find differentially expressed transcripts (which would involve a control and an experimental sample), whereas I am currently only trying to find truly expressed transcripts (from, lets say, individual samples). And b) DESeq2 is a quite robust method, whereas I am looking for a more loose method for the reason I mentioned.

So, could you tell briefly as to how to implement Poisson/Negative Binomial to find expressed (i.e. above basal level) enhancers in this scenario? I am sure its quite trivial and I understand both these distributions, but its just that I haven't ever done this type of analysis. Thanks a lot!

ADD REPLY • link 3.7 years ago by c_u ▴ 530

1

Entering edit mode

By calculating a base level, do you mean you will take a random selection of intervals, and then use them to fit a distribution, and then test if you regions of interest have a significantly higher level than this?

ADD REPLY • link 3.7 years ago by i.sudbery 21k

0

Entering edit mode

That's intriguing, but honestly, my naive idea was to get the expected # of counts for an enhancer element (which would be a bin of 3000 bp) by dividing the total read counts by the number of bins. In this case the number of bins would be the [mouse genome length]/3000. Now this expected # of counts would be my lambda for Poisson distribution, and I can use it to get the probability of 1/2/3... counts for any bin. If the probability of a particular # of counts (greater than the expected number) is small (say <0.05), I can say that the associated enhancer element is expressed above basal level.

Do you think this makes sense or am I way off? I am completely OK with an approximate method for now, I am not trying to be too precise since it is a preliminary identification of expressed enhancers.

ADD REPLY • link 3.7 years ago by c_u ▴ 530

1

Entering edit mode

A random selection will not do, as you need to ensure that GC content and mappability match between each enhancer and its background. This is a non-trivial task, and requires a powerful test as for 70k enhancers you have a massive multiple testing burden. Not sure if a simple permutation test will have enough power unless you do a lot (hundreds of thousands) of permutations, and this for each enhancer. Just thinking aloud, I have no precise strategy in my mind right now, sorry.

ADD REPLY • link 3.7 years ago by ATpoint 87k

1

Entering edit mode

Yeah, you'd definitely need to correct for GC and mappability. You'd also want to exclude regions in known genes. You might think this is not enough regions to affect the average, but although only 2% of the genome is in protein coding exons, quite a large fraction of the genome is between the transcription start site, and transcription end side of a gene, and you'll definitely capture that if you are using total RNA-seq. If you don't do that, I guarantee 90% of your hits will be introns - introns are transcribed much more highly than enhancers - you might even find that enhancers are below the genome average read counts if you include introns.

My advice would be to suck it and see - exclude gene regions and known enhancers. Fit a Poisson model that is something like glm(read_counts~mappability + GC, family = "poisson", data=df) and do some diagnostics to see who well the model fits (look at the mean-variance relationship, the residuals etc). Then test on a positive control set of known enhancers.

ADD REPLY • link 3.7 years ago by i.sudbery 21k

0

Entering edit mode

Thanks Ian, I will try it out!

BTW, I think I can write a code to get the GC content of all the regions, but I am not sure how to get the mappability of each region. I would need both of them to fit the model using glm. Any ideas?

Edit - I searched a bit and found that UCSC provides mappability tracks for mouse. So I could use that. If there is a better way, feel free to respond