I am new to process somatic mutation by ICGC data. In simple_somatic_mutation.aggregated.vcf.gz
(https://dcc.icgc.org/releases/current/Summary), I got vcf
format file. Every Mutation ID
in the data annotated how many donor affected
. Is this the mutation number? When I wanted more detail data, only .tsv
file provided. I am also confused about there are a number of same Mutation ID
. I mean, in a sample, why there are more than one record at same chromosome loci?
For example,https://dcc.icgc.org/donors/DO51576, this donor has a mutation ID MU28652212. It just affected one donor Across all Projects, while in .tsv
file of project LUSC-CN, there are 5 rows of MU28652212. When I compute the mutation counts, should I treat it as 1 mutation or 5 mutation?
Please help.
This is because 5 transcripts are affected by mutation MU28652212. You need to prioritize a transcript out of 5. One way to do this is use maf2maf which will do this for you. You can use mafttols to convert ICGC simple somatic mutation format to MAF and further process them (apologies for shameless promotion)
Hello, I am wondering what reference genome verison and gene model need to do maf2maf for ICGC simple somatic mutation format. I test GRCh37.69, GRCh37.75 and GRCh37.102, and all these didn't work at all. (messages are like this: [faidx] Failed to fetch sequence in 38078819:38078818-38078820 ERROR: Make sure that ref-fasta is the same genome build as your MAF)