Question

TCGA metadata intersection

0

Entering edit mode

9 months ago

Darked89 4.7k

Hello,

I am trying to intersect / match data available for the TCGA-BRCA project (open access only). To be more precise:

1. I did look for mutations in a particular gene using GDC web portal. This gave me (simplified):

Case_ID Project
TCGA-BH-A2L8    TCGA-BRCA
TCGA-AR-A1AO    TCGA-BRCA
TCGA-A2-A1FZ    TCGA-BRCA

2. extracted Case_ID and used it in R TCGAbiolinks:

case_id_list <- c("TCGA-BH-A2L8", "TCGA-AR-A1AO")
query_expression <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      barcode = case_id_list,
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Primary Tumor","Solid Tissue Normal"))
GDCdownload(query_expression)

The case_id_list was longer. and I got bunch of dirs folders in .//GDCdata/TCGA-BRCA/Transcriptome_Profiling/Gene_Expression_Quantification/ which correspond to file_id, such as 0877bc64-fbf4-427f-a889-21a4a9102600

My problems:

A) given TSV file ID find out the case_ID/barcode/whatever letting me figure out to which case_ID a given TSV file belongs to; and
B) While I can redo the download of TSV expression files, how do I get the info about the data being from "Primary Tumor" or "Solid Tissue Normal"?

I have tried:

library("GenomicDataCommons")
ge_manifest <- files() %>%
    filter( cases.project.project_id == 'TCGA-BRCA') %>% 
    filter( type == 'gene_expression' ) %>%
    filter( analysis.workflow_type == 'STAR - Counts')  %>%
    filter( access == 'open') %>%
    filter( file_id == '0877bc64-fbf4-427f-a889-21a4a9102600') %>%
    manifest()

head(ge_manifest)

But I do not see any column resembling values from my points A or B

EDIT Getting a bit closer:

files_ids <- c("0877bc64-fbf4-427f-a889-21a4a9102600",
"08837ae7-6f4f-4aa1-8722-7c404b66ed75")

case_ids <- cases() %>% 
filter(~ project.project_id == "TCGA-BRCA") %>%
filter( files.file_id == files_ids) %>%
ids()

 #case_ids contains  '30ec8b1f-28c4-4f46-8a1b-a8d51e558c7d', '87b85935-a058-44ad-8fb6-8511130eaffe'

R TCGA GenomicDataCommons TCGAbiolinks • 522 views

ADD COMMENT • link 9 months ago by Darked89 4.7k

score 1 · Answer 1 · 2024-02-26

1

Entering edit mode

9 months ago

Zhenyu Zhang ★ 1.2k

You probably want to tag "TCGAbiolinks" because this is not a TCGA question, but a question about the particular package. Btw, if it's a general TCGA/GDC question, you can add these files, and download sample sheet from the cart.

ADD COMMENT • link 9 months ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

Improved the tags as suggested. While I have posted the code using two particular R libraries I do not care if the solution uses GDC Python API or even curl. Just to make some sense from the TCGA-BRCA expression data i.e. starting with p53 mutations I need to have as a minimum patient_id, mutation_type, mutation_site, expression_tsv_file. While one can get few multicolumn TSVs from GDC www and work with these, I would prefer to have this as a reproducible code. I am aware that some data (checked drug therapy as a substitute for guessing breast cancer subtypes) is not properly curated (misspelled names, compound vs commercial drug names etc.) so "some custom coding required".

ADD REPLY • link 9 months ago by Darked89 4.7k