Question

Identify Genes Harbouring More Mutations Than Expected And Their Significance

1

Entering edit mode

12.4 years ago

fo3c ▴ 450

Hello,

I am looking at mutations in exome data and would like to identify which genes harbour more mutations than would be expected given the average mutation rate of my cohort.

Currently, I model the number of mutations as a Poisson random variable with parameter lambda = average mutation rate per Mb * gene length in Mb. However, the expected number of mutations is very low, and the observations appear significant (p < 0.05) in all cases. E.g. I observe one mutation where I exepct 0.019948561 mutations, for a p-value = 1.963461e-04.

Is there a better way to do this? Should I improve the model, or is there a clever way to correct the p-values? In R, p.adjust results in a very small change to the p-values.

mutation exome statistics p-value r • 3.3k views

ADD COMMENT • link updated 12.4 years ago by Malachi Griffith 20k • written 12.4 years ago by fo3c ▴ 450

score 2 · Answer 1 · 2013-03-07

The tool MuSiC and accompanying paper: MuSiC: identifying mutational significance in cancer genomes has a section for identifying 'significantly mutated gene tests'. That discussion seems possibly relevant to this question. From the paper:

We use the concept of “significantly mutated genes” (SMG) to describe genes that show a significantly higher mutation rate than the background mutation rate (BMR) when multiple mutational mechanisms (coding indel and single nucleotide substitution, splice site mutation, etc.) are considered. Specialized measurements of the BMR may also be considered; BMRs in MuSiC are optionally calculated across the entire sample set, across particular subgroups of similarly mutated samples, or for each sample individually. For each BMR subgroup considered and for each category of mutational mechanism, the mutation rates are compared to the appropriate BMR, and a single P-value summarizing all considerations is generated for each gene. We refer to this summarization procedure as the significantly mutated gene (SMG) test.

We assessed multiple methods of calculating summarized P-values, including a convolution test (CT), a Fisher's combined P-value test (FCPT), and the likelihood ratio test (LRT), using a partially simulated data set (this data set and the associated test simulations are described in the Supplemental Material). By this approach, we determined that the P-value distribution obtained using the CT method most closely resembled the uniform distribution expected under the null (in this case, the null is such that no gene is truly significantly mutated), while the FCPT and LRT methods produced slightly inflated or deflated P-values, respectively (Supplemental Fig. S1). During the SMG test, a false discovery rate (FDR) also is calculated. We evaluate our SMG test results by establishing a P-value or FDR threshold (threshold typically 0.2 or less for FDR), and then appropriately filtering the test output.