Hello,
I am an IT student who is developing some code to analyze RNA-Seq data from TCGA. The problem is that I have barely studied biology and I am having difficulties to understand how TCGA/GDC portal works.
My biggest problem here is how to link my rna-seq data with its clinical information. For example, I have looked for TCGA-ACC, with: experimental_strategy = RNA-Seq, data_category = Transcriptome Profiling, data_type = Gene Expression Quantification, workflow_type = Star - Counts, and I got a total of 79 files and I saved this information:
***gene_id gene_name gene_type unstranded stranded_first stranded_second tpm_unstranded fpkm_unstranded fpkm_uq_unstranded
ENSG00000000003.15 TSPAN6 protein_coding 2086 1014 1073 14.6098 10.0166 14.3766
ENSG00000000005.6 TNMD protein_coding 2 1 1 0.0430 0.0295 0.0424***
Then, I donwloaded the clinical data for TCGA-ACC, and I got 92 cases.
Now , to be honest, I don't understand how to link my 72 rna-seq files with its 92 cases. I have already looked the patient BARCODE and the file's ENTITY ID, but, for example, with this file: https://portal.gdc.cancer.gov/files/d35e06d0-385c-4ba1-aaa3-e71c89effbdc, its ENTITY ID is TCGA-OR-A5JD-01A-11R-A29S-07, and if I look for TCGA-OR-A5JD in the nationwidechildrens_clinical_patient_acc file I have downloaded, I CANNOT find it (and vice versa... I select one barcode of the clinical data and I cannot find its file for the rna-seq data).
So, my question here is: Am I doing wrong with this procedure? Or shall I just use the files whose Entity ID is also in the clinical data and drop the ones I cannot link?
The reason I am trying to understand it, it is because I have found tutorials like this: http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html where apart form the data of the GENE COUNTS that I have, it uses a SAMPLE INFO data (very useful to have more descriptive plots like the MDS plot you can find in Quality Control > Multidimensional scaling plots part)... so I want to do something similar with the TCGA data.
P.S: I am not a native English speaker, so I am so sorry If I had some mistakes writting. P.S2: Any tutorial, guide, code examples, or advice will be welcome and really really appreciated. So, thanks for the help! :) P.S3: I am developing a new API for this in python, so I cannot just use R facilities as BioConductor, DeSEQ2...!!
Thank you very much for the help! :)