Question

TCGA: Discrepancy in mutation records numbers based on data acquistion method?

0

Entering edit mode

6.2 years ago

sc ▴ 20

Hi all,

Why do I get different number of records for a gene mutation based on what method I use to obtain the data?

For example, if I wanted to know which patients had a KRAS mutation in the LUAD dataset from TCGA:

1) I could try below using GDCquery which gives me 117 records.

library(TCGAbiolinks)
library(maftools)
maf <- GDCquery_Maf("LUAD", pipelines = "muse")
maf_kras <- maf[which(maf$Hugo_Symbol == 'KRAS'),]
length(rownames(maf_kras))

[1] 117

2) Using the MAF file from the analysis done at: http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/LUAD-TP/MutSigNozzleReport2CV/nozzle.html tells me there are 161 mutated samples.

maf2 <- read.maf("./LUAD-TP.final_analysis_set.maf.txt")
gene_summary <- getGeneSummary(maf2)
gene_summary <- gene_summary[which(gene_summary$Hugo_Symbol == "KRAS"),]
gene_summary$MutatedSamples

[1] 161

And all 161 of these samples have a unique patient ID if we check the first 12 character patient ID:

kras_mutant_barcodes <- genesToBarcodes(maf = maf2, genes = "KRAS", justNames = TRUE)
kras_mutant_barcodes <- substr(unique(as.character(unlist(kras_mutant_barcodes))), start = 1, stop = 12)
length(unique(kras_mutant_barcodes))

[1] 161

Additionally, is it correct to assume that if a patient does not have a KRAS mutation record then they are considered to be a non-mutant?

Thanks!

R TCGA mutation MAF • 1.5k views

ADD COMMENT • link updated 6.2 years ago by Kevin Blighe 89k • written 6.2 years ago by sc ▴ 20

score 1 · Accepted Answer · 2019-06-07

While surprising for first-time users of these programs, it is not surprising to people like I who have already processed much of the TCGA data. TCGAbiolinks and GDAC (Broad Institute) can both be regarded as third parties, in terms of TCGA data housing. They will have pulled data at a specific time-point from the GDC (Genomic Data Commons) and processed/filtered it in a certain way. Keep in mind, in this regard, that the data at the GDC has been changing/updating over the past few years. It may prove a futile exercise to find out, therefore, the exact reasons behind the discrepancy.

Whenever I need TCGA data,I take it direct from the GDC and avoid the use of any third party, and I time stamp the download. MAF files are Level 3 (open access) at the GDC, but there may be more than 1 for a particular cancer, reflecting the fact that the sequencing and data processing was performed at different centres. Also, the same sample may have been sequenced at 2 or more centres - keep this in mind.

Additionally, is it correct to assume that if a patient does not have a KRAS mutation record then they are considered to be a non-mutant?

Possibly, or the depth of coverage may have been low over the region in one or more samples, and thus nothing was called. You wouuld have to obtain the original BAM files in order to obtain the complete picture.

Kevin