Question

MutSigCV input files

2

Entering edit mode

7.7 years ago

haiying.kong ▴ 360

On the document for MutSigCV: http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/MutSigCV

I know you can use the datasets come along with the software, but it is not going to be the best if you can provide details from your own data. So I am trying to provide information from myown data.

But this is very confusing. How are these defined?

CpG transitions
CpG transversions
C:G transitions
C:G transversions
A:T transitions
A:T transversions
null+indel mutations

7 is clear. (1) How is CpG defined? Is it a CpG as long as ref_allele is C/G and it has adjacent nucleotide G/C, or it has to be CpG island? (2) What are C:G and A:T?

If I look at the data set comes along with software:

exome_full192.coverage.txt

gene effect categ coverage A1BG noncoding A(A->C)A 12

A1BG noncoding A(A->C)C 14

A1BG noncoding A(A->C)G 15

A1BG noncoding A(A->C)T 9

A1BG noncoding A(A->G)A 12

A1BG noncoding A(A->G)C 14

A1BG noncoding A(A->G)G 15

A1BG noncoding A(A->G)T 9

the categ column is not consistent with how it is defined in other input datasets.

What is coverage here? Is this tumor alternative count? The documentation is so confusing.

software • 5.0k views

ADD COMMENT • link 7.7 years ago by haiying.kong ▴ 360

score 0 · Answer 1 · 2017-03-24

To get categ for the mutations, I interpreted the terms as CpG: reference allele C/G with adjacent G/C. C:G: reference allele base pair CG A:T: reference allele base pair AT

For the coverage data, I removed all mutations with value "null" for "effect". As the instruction on the website: http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/MutSigCV I made wide table for coverage, each patient takes one column for coverage information. Then I got error message that says something like table is too wide..... (machine is dead, I cannot copy the error) So I tried to make it a long table with the columns: gene, effect, categ, patient, coverage. The machine crashed, and I have to do when admin is back on Monday.

If any one succeeded with using your own coverage data not the one that comes along with the software, could you please tell me what the coverage data file should look like? (1) column names (2) Which number do you use for coverage? Is it the count of alternative allele in tumor? (2) How do you treat duplicates? For same gene, effect, categ, patient, there can be multiple rows. Do you take max for the coverage?