Hi,
I am trying to use MuSiC to analyse mutation rates in novel, non-coding genes. I am able to successfully run the relevant commands in MuSiC and the coverage statistics look correct, but the results show no mutations in any genes (which I know isn't true). My guess is that there is probably some formatting issue with the .maf file containing somatic mutations, which is causing the output of the "bmr calc-bmr" to be inaccurate.
Here are the first few lines of my .maf file
#version 2.3
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID
Unknown 0 genome.wustl.edu GRCh37-lite 1 322115 322115 + Targeted_Region SNP G A G NA NA TCGA-E2-A15K TCGA-E2-A15K G G NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 328193 328193 + Targeted_Region SNP A A G NA NA TCGA-E2-A15K TCGA-E2-A15K A A NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 384901 384901 + Targeted_Region SNP G A G NA NA TCGA-E2-A15K TCGA-E2-A15K G G NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 390657 390657 + Targeted_Region SNP A A G NA NA TCGA-E2-A15K TCGA-E2-A15K A A NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 404577 404577 + Targeted_Region SNP G A G NA NA TCGA-E2-A15K TCGA-E2-A15K G G NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Here are the music commands that I am using:
genome music bmr calc-covg --bam-list /path/to/bam.list --output-dir /path/to/output_folder --reference-sequence /path/to/GRCh37-lite.fa --roi-file /path/to/gene_coordinates.bed
genome music bmr calc-bmr --bam-list /tcga/users/cdwarden/wgs/BRCA/MuSiC/bam.list --maf-file /path/to/somatic.maf --output-dir /path/to/output_folder --reference-sequence /path/to/GRCh37-lite.fa --roi-file /path/to/gene_coordinates.bed
genome music smg --gene-mr-file /path/to/gene_mrs --output-file /path/to/smgs
I have also tried adding the transcript ID to the first mutation in the .maf file (so that I would expect to see one mutation in the smgs_detailed
file), but that gene still is reported to have 0 mutations.
Can you please help me troubleshoot this issue?
Thanks,
Charles
I think its because Hugo_Symbols are Unknown in your maf file.
I changed the transcript ID for the first mutation to match the corresponding gene, and that gene was still reported to not have any mutations. Also, I used "Unknown" (instead of NA, etc.) because that is what I thought the .maf format required for such genes.
Is there something else that should be changed besides "Unknown"?
I have used this program a while back, and what I understand is, the gene names in maf file must match the gene names in your roi file, which you use for
calc-covg
function. Also, it will skip all those silent variants inVariant_Classification
column ; unless you mention not skip so. In your example, I see that most of the variants haveVariant_Classification
set toUnknown
, which might be the one reason.This is correct. The Hugo_Symbol needs to be properly defined. These calls seem to be annotated incorrectly as
Targeted_Region
, which is something that MuSiC skips as intergenic. Considering that the MAF saysWGS
, these might be legitimately intergenic calls. Check in a genome browser.Yes - I want to characterize mutation rates in ncRNAs (most of which will not be covered in exome designs, and many of which are novel).
What would you recommend for the Variant_Classification and Variant_Type, in this situation?
You can refer to the documentation here. When you run
music bmr calc-bmr
, enable the option--noskip-non-coding
. You'll still need to annotate each variant with a symbol that it can match back to a region in your ROI file. MAF format is not as detailed in distinguishing between ncRNA types.Variant_Classification
will always sayRNA
. But name the genes differently using annotators like VEP, and you should be fine. Have you tried the maf2maf tool?Thank you very much !!
This is also something i wonder how to prioritize such intergenic/intronic SNVs.