TCGA metadata intersection
1
0
Entering edit mode
9 months ago
Darked89 4.7k

Hello,

I am trying to intersect / match data available for the TCGA-BRCA project (open access only). To be more precise:

1. I did look for mutations in a particular gene using GDC web portal. This gave me (simplified):

Case_ID Project
TCGA-BH-A2L8    TCGA-BRCA
TCGA-AR-A1AO    TCGA-BRCA
TCGA-A2-A1FZ    TCGA-BRCA

2. extracted Case_ID and used it in R TCGAbiolinks:

case_id_list <- c("TCGA-BH-A2L8", "TCGA-AR-A1AO")
query_expression <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      barcode = case_id_list,
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Primary Tumor","Solid Tissue Normal"))
GDCdownload(query_expression)

The case_id_list was longer. and I got bunch of dirs folders in .//GDCdata/TCGA-BRCA/Transcriptome_Profiling/Gene_Expression_Quantification/ which correspond to file_id, such as 0877bc64-fbf4-427f-a889-21a4a9102600

My problems:

A) given TSV file ID find out the case_ID/barcode/whatever letting me figure out to which case_ID a given TSV file belongs to; and
B) While I can redo the download of TSV expression files, how do I get the info about the data being from "Primary Tumor" or "Solid Tissue Normal"?

I have tried:

library("GenomicDataCommons")
ge_manifest <- files() %>%
    filter( cases.project.project_id == 'TCGA-BRCA') %>% 
    filter( type == 'gene_expression' ) %>%
    filter( analysis.workflow_type == 'STAR - Counts')  %>%
    filter( access == 'open') %>%
    filter( file_id == '0877bc64-fbf4-427f-a889-21a4a9102600') %>%
    manifest()

head(ge_manifest)

But I do not see any column resembling values from my points A or B

EDIT Getting a bit closer:

files_ids <- c("0877bc64-fbf4-427f-a889-21a4a9102600",
"08837ae7-6f4f-4aa1-8722-7c404b66ed75")

case_ids <- cases() %>% 
filter(~ project.project_id == "TCGA-BRCA") %>%
filter( files.file_id == files_ids) %>%
ids()

 #case_ids contains  '30ec8b1f-28c4-4f46-8a1b-a8d51e558c7d', '87b85935-a058-44ad-8fb6-8511130eaffe'
R TCGA GenomicDataCommons TCGAbiolinks • 522 views
ADD COMMENT
1
Entering edit mode
9 months ago
Zhenyu Zhang ★ 1.2k

You probably want to tag "TCGAbiolinks" because this is not a TCGA question, but a question about the particular package. Btw, if it's a general TCGA/GDC question, you can add these files, and download sample sheet from the cart.

ADD COMMENT
0
Entering edit mode

Improved the tags as suggested. While I have posted the code using two particular R libraries I do not care if the solution uses GDC Python API or even curl. Just to make some sense from the TCGA-BRCA expression data i.e. starting with p53 mutations I need to have as a minimum patient_id, mutation_type, mutation_site, expression_tsv_file. While one can get few multicolumn TSVs from GDC www and work with these, I would prefer to have this as a reproducible code. I am aware that some data (checked drug therapy as a substitute for guessing breast cancer subtypes) is not properly curated (misspelled names, compound vs commercial drug names etc.) so "some custom coding required".

ADD REPLY

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6