Question

different sample size between TCGA portal and TCGAbiolinks package

0

Entering edit mode

20 months ago

tyasird ▴ 10

I was looking for the mutation data through TCGA portal using TCGAbiolinks and I have realized that sample size are not the same.

for instance TCGA-OV case TCGA data portal shows 419 cases, however TCGAbiolinks shows 462 samples. File counts are the same for both it is 482.

so why it is different?

this my query in TCGA data portal:

cases.project.project_id in ["TCGA-OV"] and files.analysis.workflow_type in ["Aliquot Ensemble Somatic Variant Merging and Masking"] and files.data_category in ["Simple Nucleotide Variation"] and files.data_type in ["Masked Somatic Mutation"]

enter image description here

this is same query in the TCGAbiolinks package:

#query
query <- GDCquery(
  project = "TCGA-OV", 
  data.category = "Simple Nucleotide Variation", 
  access = "open",
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)

#download & read
GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
mutations = mafSummary(mafr)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))

enter image description here

mutation tcga tcgabiolinks • 1.5k views

ADD COMMENT • link updated 19 months ago by Zhenyu Zhang ★ 1.3k • written 20 months ago by tyasird ▴ 10

0

Entering edit mode

You're comparing samples to cases. Can you check aliquot counts in both cases?

ADD REPLY • link 20 months ago by Ram 45k

0

Entering edit mode

I thought 482 files = aliquots, isn't it like that? Or in another way to ask how can I find the sample number of given TCGA query in the portal? this is the query link TCGA-OV

ADD REPLY • link 19 months ago by tyasird ▴ 10

1

Entering edit mode

I'm not entirely sure that num_files would equal num_aliquots. Please try and dig deeper to check if that's the case. I apologize, but I don't have the time to do a TCGA deep dive right now.

ADD REPLY • link 19 months ago by Ram 45k

score 0 · Answer 1 · 2023-10-15

0

Entering edit mode

20 months ago

Zhenyu Zhang ★ 1.3k

In the GDC query, you got 419 cases and 482 files (likely 482 aliquots). In the tcgabiolinks query, you got 462 samples. You are comparing apples to oranges.

ADD COMMENT • link 20 months ago by Zhenyu Zhang ★ 1.3k

0

Entering edit mode

when I go into 419 cases I see it shows 418 females. It doesn't mean that this is 418 samples? If it is not, how I can get sample number from TCGA portal for this query TCGA-OV

ADD REPLY • link 19 months ago by tyasird ▴ 10

1

Entering edit mode

Most of the cases in GDC have at least one tumor sample and one normal sample, and some could have more tumor samples such as metastasis and new primary, etc. So case count is not sample count.

In the link you have, there are only case tab and file tab. There are no summary tab for samples. If you really want to get samples, you can learn the GDC API, or add all files into cart, and download sample sheet from the cart.

ADD REPLY • link 19 months ago by Zhenyu Zhang ★ 1.3k