Hi all,
Why do I get different number of records for a gene mutation based on what method I use to obtain the data?
For example, if I wanted to know which patients had a KRAS mutation in the LUAD dataset from TCGA:
1) I could try below using GDCquery which gives me 117 records.
library(TCGAbiolinks)
library(maftools)
maf <- GDCquery_Maf("LUAD", pipelines = "muse")
maf_kras <- maf[which(maf$Hugo_Symbol == 'KRAS'),]
length(rownames(maf_kras))
[1] 117
2) Using the MAF file from the analysis done at: http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/LUAD-TP/MutSigNozzleReport2CV/nozzle.html tells me there are 161 mutated samples.
maf2 <- read.maf("./LUAD-TP.final_analysis_set.maf.txt")
gene_summary <- getGeneSummary(maf2)
gene_summary <- gene_summary[which(gene_summary$Hugo_Symbol == "KRAS"),]
gene_summary$MutatedSamples
[1] 161
And all 161 of these samples have a unique patient ID if we check the first 12 character patient ID:
kras_mutant_barcodes <- genesToBarcodes(maf = maf2, genes = "KRAS", justNames = TRUE)
kras_mutant_barcodes <- substr(unique(as.character(unlist(kras_mutant_barcodes))), start = 1, stop = 12)
length(unique(kras_mutant_barcodes))
[1] 161
Additionally, is it correct to assume that if a patient does not have a KRAS mutation record then they are considered to be a non-mutant?
Thanks!
Hi Kevin,
Thanks for the detailed clarification, much appreciated!