Question

Mapping IDs and file names from TCGA datasets

0

Entering edit mode

2.5 years ago

Jaehyun ▴ 10

Hello,

I want to analyze multiple files from the TCGA-BRCA project downloaded from the GDC portal. However, I have some difficulty using different data from the same samples.

For example, a case ID TCGA-E2-A1IU has proteome profiling and DNA methylation data. The problem is that the methylation file has its unique name (e2ae1cb4-0422-4c7d-9878-ba8449afaacd.methylation_array.sesame.level3betas.txt), and other IDs or barcodes do not exist in the file.

I searched other files for the case, such as biospecimen or clinical data files, but I could not find any information that maps between case IDs and file IDs. I may write all of the case ID and file name pairs one by one, but it may be inefficient.

I would appreciate it if you could teach me any method to solve the problem.

Thank you.

TCGA • 2.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 2.5 years ago by Jaehyun ▴ 10

score 2 · Accepted Answer · 2023-03-02

What I usually do is add all of the desired files to the cart and then download the sample sheet, which looks like this:

File ID File Name   Data Category   Data Type   Project ID  Case ID Sample ID   Sample Type

efbd072e-9d71-4729-84d8-3eb34996078f    e2ae1cb4-0422-4c7d-9878-ba8449afaacd.methylation_array.sesame.level3betas.txt   DNA Methylation Methylation Beta Value  TCGA-BRCA   TCGA-E2-A1IU    TCGA-E2-A1IU-01A    Primary Tumor

fc52a9be-ac8d-41ba-a9d2-a6dc556bc2cf    TCGA-E2-A1IU-01A-21-A17J-20_RPPA_data.tsv   Proteome Profiling  Protein Expression Quantification   TCGA-BRCA   TCGA-E2-A1IU    TCGA-E2-A1IU-01A    Primary Tumor

It is annoying that there's no easier way to do this (I'm sure you could do it with the API though), so what I usually do is to use the R/Bioconductor package TCGAbiolinks, which handles all the files and names on its own. You simply query your patients (e.g. TCGA-E2-A1IU, although you can only handle one tumor type at a time, like BRCA) and your desired data type (e.g. Reverse Phase Protein Array, but only one data type at a time). So for example, if you have 5 BRCA patients and 3 data types you're interested in, you'd query and get 3 different objects where the columns correspond to the 5 patients.