Question

Methods for calculation of TMB via GENIE/cBioPortal data

0

Entering edit mode

7 months ago

Andrew • 0

I'm working with the AACR GENIEv15 dataset to compare mutational and sample characteristics stratified by demographic data. I'm stuck on how to calculate the tumor mutational burden (TMB) with the data provided, which I'm defining as [number of somatic mutations / Megabases sequenced].

GENIE data guide pasted for reference below.

https://www.aacr.org/wp-content/uploads/2024/02/15.0-public_data_guide-.pdf

The part I'm struggling with is figuring out the total megabases sequenced for each specified sequencing assay. Looking under "assay information", the closest corresponding provided data would be "read length", which is defined simply in the linked GDC data dictionary as "The length of the reads."; can this be assumed to be in megabases? It's suspicious to me that the variation in read_length doesn't seem to correspond to the # of genes tested, and makes me wonder if I'm looking in the wrong area (ie: MSK-IMPACT-468 tests 468 genes but has a read length of 1; DFCI oncopanel 3 tests 447 genes with a read length of 100), but I don't see any other variable which would clearly correspond to the total # of mb sequenced.

I assume that the "start position" and "end position" listed under "genomic information" for each sequenced gene would best correspond to the # of bp sequenced for that specific gene; my idea is to calculate the delta for each sequenced location for each gene tested by a sequencing assay, and use the sum as the total # of sequenced bases.

My above method seems that it would theoretically work, but seems labor intensive compared to a similar reference paper I'm using (https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2800081) which states TMB was simply extracted from cBioPortal, which again makes me think that I'm missing something.

Appreciate any assistance!

GENIE TMB • 634 views

ADD COMMENT • link updated 7 months ago by Zhenyu Zhang ★ 1.2k • written 7 months ago by Andrew • 0

score 0 · Answer 1 · 2024-09-05

You need their individual bed files.
You can merge all the segments in the bed file, if there are overlapped segments.
You need to look into what other region based filtering strategies have applied to the mutation file. For example, if the mutations have also been filtered by exons, you then need to filter your bed files with exon region as well.
You need to decide if you want to use all mutations or non-synonymous.
Finally you can do the division.

Please also note that TMB you get from different kits are not really comparable.