I was looking for the mutation data through TCGA portal using TCGAbiolinks and I have realized that sample size are not the same.
for instance TCGA-OV case TCGA data portal shows 419 cases, however TCGAbiolinks shows 462 samples. File counts are the same for both it is 482.
so why it is different?
this my query in TCGA data portal:
cases.project.project_id in ["TCGA-OV"] and files.analysis.workflow_type in ["Aliquot Ensemble Somatic Variant Merging and Masking"] and files.data_category in ["Simple Nucleotide Variation"] and files.data_type in ["Masked Somatic Mutation"]
this is same query in the TCGAbiolinks package:
#query
query <- GDCquery(
project = "TCGA-OV",
data.category = "Simple Nucleotide Variation",
access = "open",
data.type = "Masked Somatic Mutation",
workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
#download & read
GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
mutations = mafSummary(mafr)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))
You're comparing samples to cases. Can you check aliquot counts in both cases?
I thought 482 files = aliquots, isn't it like that? Or in another way to ask how can I find the sample number of given TCGA query in the portal? this is the query link TCGA-OV
I'm not entirely sure that num_files would equal num_aliquots. Please try and dig deeper to check if that's the case. I apologize, but I don't have the time to do a TCGA deep dive right now.