We are analyzing cancer patient mutation data. We defined set of region on the human genome as binding events, and would like to prove that some of the region have significantly higher number of mutations than others. To prove this, we decided to set a mutation number threshold to say, there is at least X number of mutations are required to say this region is a hotspot.
Parameters:
- We assume that mutation probability is constant within these regions.
- We have total number of 196 patients in this project.
- We have 4500 binding events (interested regions)
- Have total 960 mutation found in the proximity of the regions.
- 750bp is the median of the all binding events ( for our discussion lets assume they are all 750 bp)
- We have total number of 196 patients in this project
Attempt: To solve this problem, I thought implementing "Normal Approximation to binomial distribution" could be useful.
Questions:
1. I will test each binding region mutation number, X, with Uo. As a final procedure, I will do multiple hypothesis testing with respect to number of total binding regions. Is this correct?
Referring to this question, the OP asked a very similar question. But the OP is mainly focused on patient wise which I don't think it is not relevant in my case. Therefore:
2. Could implementing a Poisson distribution be more accurate?
I am very confused so any guidance will be very helpful.