I tried testing MuSiC (v0.4) but have gotten substantially more significantly mutated genes then expected. For my own pan-cancer dataset it returned thousands of genes, and for a testing set of published Ovarian somatic mutations returned ~350. I used the default settings for these runs using calc-wig-covg
, calc-bmr
, and smg
(so I didn't need the BAM files). I obtained the ovarian MAF file from synapse (https://www.synapse.org/#!Synapse:syn1729383 ), coverage wig files from firehose (recommended on this post), and recommended ROI file (here). Is there anything I'm missing or are there parameter tweaks or changes so MuSiC reports a more reasonable number of significantly mutated genes?
I saw a couple of parameters that might be helpful. One was the --bmr-groups
option in genome music bmr calc-bmr
, which appears to group samples into a certain number of similarly mutated groups. Is there a recommended way to set up the number of BMR groups? Another was the --bmr-modifier-file
option in genome music smg
as a multiplication factor for the background mutation rate for certain genes. Is there a standard/recommended BMR modifier file?
Thank you for your informative response. Yes, my pan-cancer hypermutator filter was 1000 mutations, which is higher than your 200. Actually the one parameter I did change was max fdr to .1 (common choice in cancer sequencing studies), but it looks like for the above reasons that expected false positives taken literally from the definition of the estimated FDR (false discovery rate) should be taken with a "grain of salt". I did filter out variants with read mappability warnings, but didn't do a filter on allele frequency. Do you have an intuition on the frequency which actual germline variants pollute called somatic variants in published studies? Unfortunately since I'm doing a comparative analysis of methods, it only makes sense to move the FDR threshold together with all methods and evaluate the same set of pan-cancer mutations (as a whole, instead of broken up). I'm well aware that estimating mutation rates are fickle things (even in human evolution). In some aspects, perhaps, accurately understanding the uncertainty of estimates for mutation rates is almost more important than getting a single good point estimate.
Correct. The FDR can't be used in the traditional sense, because the regional differences in BMRs add too much noise in the range/rank of per-gene p-values... or something like that.
ExAC is fairly recent, so a lot of previously generated somatic mutation lists like from TCGA/ICGC did not a decent panel-of-normals for germline filtering. Any kind of uneven coverage or allele-specific amplification bias, can make a germline variant look like it's somatic.
From the MuSiC paper, it seems the convolution test is the most preferred, but an SMG is called when it is significant in at least 2 of 3 tests. In your opinion, do you think using solely the most conservative p-value method might be reasonable in my scenario (typically FCPT)? I also plan to do some parameter testing on the bmr groups.
Yea, you could try only FCPT, but it has horrible sensitivity. If you're comparing methods... then you can report results separately for "MuSiC FCPT", "MuSiC CT", "MuSiC 2of3"... something like that
I asked because FCPT in the ovarian test data (which the synapse entry seems to basically match the suggested mutation filters) reports like ~50 genes down from ~350 for 2 of 3. Both the convolution test and LRT reported 355 and 495, respectively. So basically since both LRT and CT seem to be driving up the number of significant genes, the 2 of 3 scenario is still reporting many. I'm not against reporting "2of3" or "CT" or etc., I'm just trying to see if I can get the "best" parameterization that is consistent for all comparison evaluations.