Hello All,
I'm a machine learning graduate student (PhD) working on a bioinformatics project. I'm a bit of a bioinformatics newbie so sorry if this is a dumb or duplicate question.
I'm working with TCGA PanCancer data and from what I've seen cBioPortal seems to be the easiest to use.
Specifically I am working with the invasive breast carcinoma data from here on cBioPortal.. There are four files with RNA seq data:
data_mrna_seq_v2_rsem.txt
data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt
data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt
My hope is to use the data from data_mrna_seq_v2_rsem.txt
but I can't find the units used here. The metadata says this is "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" but I can't tell if this is TPM, log_2(normalized_count+1), or raw read counts.
I've seen some sources saying it's log_2(normalized_count +1) but some of the values in the data set in data_mrna_seq_v2_rsem.txt
, such as 408.076, seem to be way too high to be log transformed.
My question is: how do I find out what these units are? and is there a 'best' way to access TCGA data?
I'm a math-oriented guy and I'm the only person in my lab, so sorry if this question is a pain.
I personally prefer getting pan-cancer TCGA data from https://xenabrowser.net/datapages/?dataset=TCGA-GTEx-TARGET-gene-exp-counts.deseq2-normalized.log2&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 It is clear what they did (and they at least try to normalize for between-samples via DESeq2) and they have their code available somewhere iirc.
But in any case, I'm pretty sure that file you listed is simply RSEM estimated raw read counts. It's definitely not log-transformed. If it were TPM, the sum of the values across all genes/transcripts for any given sample would always be 1 million.