I had asked this question in a message but should have posted it here; with answer from Cyriac.
Cyriac,
Thanks for your post about pathway file generation. We are having great luck with our data now. We have been using the "ensembl_67_cds_ncrna_and_splice_sites_hg19" as our region of interest for most of the MuSiC suite. We had sequenced 60x coverage of our tumors, whole genome. We have been trying to make an 'intergenic' ROI file in order to look at the rest of our data; two thoughts were to make a ROI that goes from base 1 to n of each chromosome, the other was to make a file that spanned the regions between the exons of the file listed above. Is this a reasonable investigation, or is MuSiC designed more to look at exome data?
Thank you, David
PM from Cyriac Kandoth:
Hi david. You can post your question to Biostar. I'm sure the answer will be of interest to other users as well.. In short: yes, it's a worthwhile investigation. We have previously used MuSiC's SMG test to find significantly altered non-coding regions... with mixed results, but interesting anyhow.
You can download intron loci in GTF format from Ensembl. Here is their latest from release 72. Look for the Human GTF at this link: http://useast.ensembl.org/info/data/ftp/index.html
It will need a bit of scripting to convert the GTF into a format that MuSiC likes.
~Cyriac
Thank you for the input and the link.
Two further questions - if we use the "ensembl_67_cds_ncrna_and_splice_sites_hg19" as our ROI file for calc-covg, bmr and smg, is there no point to using the noskip-non-coding option? It seems that if our ROI is the exons, then having it include non-coding mutations would give erroneous data since calc-covg didn't look at those regions, and BMR used those regions (the introns) to calculate the background mutation rate?
Second - if I am understanding correctly, the ensembl database uses the GENCODE dataset which is based on the protein-coding loci, and in the GFT file I cannot find any "intron_CNS" features; it seems to only contain exon and CDS sites. I looked in some earlier versions and couldn't find intron regions there either. Should I create a 'gene region' that spans the start to end, including all flanking, intron, exon and CDS sites, or are there defined intron regions somewhere that I'm missing?
Thank you, DD
Edit: Regarding the second question, I was able to get an output of introns from the UCSC Browser into BED format and I'm reconfiguring that into a ROI file for MuSiC. i guess the general question stands: does the GENCODE dataset currently include introns?