How to understand somatic mutation of ICGC data
1
0
Entering edit mode
7.6 years ago
wangshx ▴ 10

I am new to process somatic mutation by ICGC data. In simple_somatic_mutation.aggregated.vcf.gz(https://dcc.icgc.org/releases/current/Summary), I got vcf format file. Every Mutation ID in the data annotated how many donor affected. Is this the mutation number? When I wanted more detail data, only .tsv file provided. I am also confused about there are a number of same Mutation ID. I mean, in a sample, why there are more than one record at same chromosome loci?

For example,https://dcc.icgc.org/donors/DO51576, this donor has a mutation ID MU28652212. It just affected one donor Across all Projects, while in .tsv file of project LUSC-CN, there are 5 rows of MU28652212. When I compute the mutation counts, should I treat it as 1 mutation or 5 mutation?

Please help.

somatic mutation ICGC genome • 3.8k views
ADD COMMENT
0
Entering edit mode

This is because 5 transcripts are affected by mutation MU28652212. You need to prioritize a transcript out of 5. One way to do this is use maf2maf which will do this for you. You can use mafttols to convert ICGC simple somatic mutation format to MAF and further process them (apologies for shameless promotion)

ADD REPLY
0
Entering edit mode

Hello, I am wondering what reference genome verison and gene model need to do maf2maf for ICGC simple somatic mutation format. I test GRCh37.69, GRCh37.75 and GRCh37.102, and all these didn't work at all. (messages are like this: [faidx] Failed to fetch sequence in 38078819:38078818-38078820 ERROR: Make sure that ref-fasta is the same genome build as your MAF)

ADD REPLY
2
Entering edit mode
7.6 years ago
solo7773 ▴ 90

'Donor affected' means how many donors/patients carry this mutation.

ICGC mainly provides data in tabular format (tsv).

Duplicates of the same 'Mutation ID' exist because this mutation affects multiple genes/transcripts. With respect to your example, it should be 1 mutation. You can also refer to this doc for another example.

ADD COMMENT
0
Entering edit mode

Thanks! If I wanted to compute the mutation spectrum, should I merge the same rows of same mutation ID in a sample into 1 mutation?

ADD REPLY
0
Entering edit mode

I think so. If you only care about the mutation within a sample, it's ok because duplicate IDs record the same mutated position, chromosome, reference allele, alter allele.

ADD REPLY

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6