Question

Statistical test for finding driver mutations.

0

Entering edit mode

8.2 years ago

Gene_MMP8 ▴ 240

One approach to identify driver mutations that drive cancer progression is to look for recurrent mutations. Now BMR (background mutation rate) gives the frequency of finding a mutation by chance (Pg) at a specific location of the genome. Usually the mutations observed are less than BMR and so are not that important . If the frequency observed is greater than BMR, only then we can conclude that they are driver mutations. Now an experiment is run where we take a sample of say 25 patients and find that the frequency of finding mutations here (P) is greater than that expected by chance,i.e, P>Pg. So how do I build the hypothesis tests to confirm/deny my findings. (Population size=N).

One way of doing so is the following
Null hypothesis or H0 is P < Pg

Alternative hypothesis or H1 is P > Pg

Let X be a binomial random variable where X~B(N,Pg) as there are N independent trials each with a probability of success Pg.
Now, P(X>=25) will give the p value and can be calculated from the binomial table. If the value is less than a signifance level (say,0.05) , we can conclude that the recurrent mutations observed are not by chance and are driver mutations and vice versa.
Am I correct in stating all of this?

sequencing gene statistics • 2.9k views

ADD COMMENT • link updated 8.2 years ago by Collin ▴ 1000 • written 8.2 years ago by Gene_MMP8 ▴ 240

score 1 · Answer 1 · 2017-04-01

I'm curious, are you trying to implement a statistical test your self on actual data? It's better to use an already established method (e.g. 20/20+, MutSigCV, OncodriveFML). If you accurately calculate the background mutation rate, you should see roughly half of the genes on either side of the BMR because most genes are passenger genes for cancer. However, if you have only 25 samples, then based on typical mutation rates for cancer you would see most genes won't even have a somatic mutation (ball park of ~100 somatic mutations per sample, but cancer such as melanoma will have much more).

You have a few problems with your setup. One the null hypothesis would be P=Pg. The larger second issue is how you estimate Pg, and the assumption you are making when using a binomial model. The binomial model will assume there is a constant background mutation rate. The background rate of mutations varies at several levels: different patient's cancers, different locations in the genome (and consequently different genes), the length of the gene, and different nucleotide based contexts (e.g. C->T say at a CpG site). All of these factors have lead to problems in statistical tests based on a binomial model (PMID: 27911828, 23770567). If you did not model these factors, you would be substantially better off using a beta-binomial model, which accounts for over-dispersion (https://en.wikipedia.org/wiki/Overdispersion ). Also, in your example setup N=25 does not necessarily mean you would calculate the p-value by P(X>=25), but rather P(X>=x) where x is the number of patients which have a (typically non-silent) mutation in the gene. Third, given you are testing many genes you would not use a nominal statistical significance level of p=0.05, but rather use something like the False Discovery Rate to control errors (https://en.wikipedia.org/wiki/False_discovery_rate ).

I would like to also point out that trying to detect driver genes based on a background mutation rate is not the only way to identify driver genes statistically. I don't have space to go into it in depth here, but alternative approaches have been generally better due to the above mentioned problem of the multiple levels of variability in the background mutation rate (PMID: 27911828). Some of these exploit clustering of mutations or high "functional impact" mutations.

Lastly even if you have a good statistical test, it usually relies on good quality somatic mutation calls. If germline mutations or mutation calling artifacts contaminate your somatic mutation data then it could lead statistical tests to erroneously reject the null hypothesis.