I tried but failed to find any "guideline" from TCGA consortium on how to compare RNA data generated from the consortium. I wonder if anyone could help point to the right direction. Thanks
The intents here are to compare the expression of gene X (or genes X1, X2, X3 ... etc) across multiple sample sets (e.g. Lung vs. Breast vs. Brain tumors), not to perform differential gene expressions within each sample and then to compare across multiple sample sets. (although also not sure if these two different intents need to use different methods/units of RNA to compare)
Per my understanding, we probably should use a TMM-based or equivalent method to normalize TCGA RNA expression data before we compare the expression across samples, However, I could not find any formal documentation/publication on this.
Efforts and background on my part
(a) I am not a stats-trained, or bioinformatics-trained researcher
(b) I've read the info in
https://haroldpimentel.wordpress.com/ and https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
(c) although TCGA provides documentation on RNAseq analysis https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#examples
they stop at FPKM or FPKM-UQ for RNA expression normalization (what's the value of knowing FPKM if you cannot compare them among samples ??)
(d) several TCGA publications I read did not provide details in materials and methods on RNAseq "analyses". To my surprise, the ones with some details use references from microarray era, not discussing FPKM/RPKM/TPM/TMM related issues.
There is some additional discussion about TCGA normalization and that table specifically here: What batch correction was applied to pan-Cancer mRNA expression data?