Understanding TCGA Pancancer data from cBioPortal
2
1
Entering edit mode
2.3 years ago
James ▴ 30

Hello All,

I'm a machine learning graduate student (PhD) working on a bioinformatics project. I'm a bit of a bioinformatics newbie so sorry if this is a dumb or duplicate question.

I'm working with TCGA PanCancer data and from what I've seen cBioPortal seems to be the easiest to use.

Specifically I am working with the invasive breast carcinoma data from here on cBioPortal.. There are four files with RNA seq data:

  • data_mrna_seq_v2_rsem.txt
  • data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
  • data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt
  • data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt

My hope is to use the data from data_mrna_seq_v2_rsem.txt but I can't find the units used here. The metadata says this is "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" but I can't tell if this is TPM, log_2(normalized_count+1), or raw read counts.

I've seen some sources saying it's log_2(normalized_count +1) but some of the values in the data set in data_mrna_seq_v2_rsem.txt, such as 408.076, seem to be way too high to be log transformed.

My question is: how do I find out what these units are? and is there a 'best' way to access TCGA data?

I'm a math-oriented guy and I'm the only person in my lab, so sorry if this question is a pain.

Cancer RNA-Seq cBioPortal • 2.5k views
ADD COMMENT
0
Entering edit mode

I personally prefer getting pan-cancer TCGA data from https://xenabrowser.net/datapages/?dataset=TCGA-GTEx-TARGET-gene-exp-counts.deseq2-normalized.log2&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 It is clear what they did (and they at least try to normalize for between-samples via DESeq2) and they have their code available somewhere iirc.

But in any case, I'm pretty sure that file you listed is simply RSEM estimated raw read counts. It's definitely not log-transformed. If it were TPM, the sum of the values across all genes/transcripts for any given sample would always be 1 million.

ADD REPLY
0
Entering edit mode
2.3 years ago
Ernest Bonat ▴ 30

Hello James,

I got the same issue when I looked at this dataset before. Here is the dataset I used to apply classification machine learning algorithms: gene expression cancer RNA-Seq Data Set. I have done many projects applying machine learning to genomics datasets. Let me know how I can help you with it?

ADD COMMENT
0
Entering edit mode
22 months ago

I guess you already got the answer, but I recommend to read the meta_XXX.txt file in the same repository. For example, you can find the brief description about the 'data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt' in 'meta_mrna_seq_v2_rsem_zscores_ref_normal_samples'. It contains the below contents:

  • cancer_study_identifier: coadread_tcga_pan_can_atlas_2018
  • genetic_alteration_type: MRNA_EXPRESSION
  • datatype: Z-SCORE
  • stable_id: rna_seq_v2_mrna_median_all_sample_ref_normal_Zscores
  • show_profile_in_analysis_tab: TRUE
  • profile_name: mRNA expression z-scores relative to normal samples (log RNA Seq V2 RSEM)
  • profile_description: Expression z-scores of tumor samples compared to the expression distribution of all log-transformed mRNA expression of adjacent normal samples in the cohort.
  • data_filename: data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6