Question

TCGA RNA-Seq data and CLINICAL Data

0

Entering edit mode

3.0 years ago

albacaro • 0

Hello,

I am an IT student who is developing some code to analyze RNA-Seq data from TCGA. The problem is that I have barely studied biology and I am having difficulties to understand how TCGA/GDC portal works.

My biggest problem here is how to link my rna-seq data with its clinical information. For example, I have looked for TCGA-ACC, with: experimental_strategy = RNA-Seq, data_category = Transcriptome Profiling, data_type = Gene Expression Quantification, workflow_type = Star - Counts, and I got a total of 79 files and I saved this information:

***gene_id  gene_name   gene_type   unstranded  stranded_first  stranded_second tpm_unstranded  fpkm_unstranded fpkm_uq_unstranded      
ENSG00000000003.15  TSPAN6  protein_coding  2086    1014    1073    14.6098 10.0166 14.3766
ENSG00000000005.6   TNMD    protein_coding  2   1   1   0.0430  0.0295  0.0424***

Then, I donwloaded the clinical data for TCGA-ACC, and I got 92 cases.

Now , to be honest, I don't understand how to link my 72 rna-seq files with its 92 cases. I have already looked the patient BARCODE and the file's ENTITY ID, but, for example, with this file: https://portal.gdc.cancer.gov/files/d35e06d0-385c-4ba1-aaa3-e71c89effbdc, its ENTITY ID is TCGA-OR-A5JD-01A-11R-A29S-07, and if I look for TCGA-OR-A5JD in the nationwidechildrens_clinical_patient_acc file I have downloaded, I CANNOT find it (and vice versa... I select one barcode of the clinical data and I cannot find its file for the rna-seq data).

So, my question here is: Am I doing wrong with this procedure? Or shall I just use the files whose Entity ID is also in the clinical data and drop the ones I cannot link?

The reason I am trying to understand it, it is because I have found tutorials like this: http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html where apart form the data of the GENE COUNTS that I have, it uses a SAMPLE INFO data (very useful to have more descriptive plots like the MDS plot you can find in Quality Control > Multidimensional scaling plots part)... so I want to do something similar with the TCGA data.

P.S: I am not a native English speaker, so I am so sorry If I had some mistakes writting. P.S2: Any tutorial, guide, code examples, or advice will be welcome and really really appreciated. So, thanks for the help! :) P.S3: I am developing a new API for this in python, so I cannot just use R facilities as BioConductor, DeSEQ2...!!

rna-seq barcode clinical gdc tcga • 2.3k views

ADD COMMENT • link updated 3.0 years ago by Zhenyu Zhang ★ 1.3k • written 3.0 years ago by albacaro • 0

score 1 · Answer 1 · 2022-07-11

Unfortunately, I can't tell you how the programmatic access to the portal works and how the API responses are structured. But if you browse the portal as a human, you would first select the project, in your case the TCGA Adrenocortical Carcinomas. In total, there are indeed 92 patients in this project.

From each of the cases, there might have been several biopsies, slides and analyses made, each will have a new distinct UUID. All of this information is contained in the Biospecimen information that you can obtain from the blue button to the top-right. I would recommend to download the JSON and explore it with JSON Visio to get a feeling for the hierarchy.

Right next to the Biospecimen button, you will find the Clinical information, again ideally viewed as JSON. This is the information that you wish to match your the Biospecimen via the Case UUID.

I hope this helps and please keep us in the loop once your Python API is ready to be used!

PS: If you just click on Exploration in the top bar, you can download a 11MB JSON file with all 12,282 TCGA cases - click the JSON button above the table. PPS: If you are stuck with the GDC portal, you might also want to check the cBio Portal, which hosts TCGA and other cancer data and has an API on its own. API doc here.

score 0 · Answer 2 · 2022-08-09

If you don't want to use API, and don't want to use GDC specific R packages, the best bet is after you added these files in the cart, you can go directly to the cart and there are sample sheet and clinical data download buttons that will give you the exact case metadata linked to these files.

For file - case linkage, the information is in the "sample_sheet"