Question

Did TCGA consortium recommend any best practice guideline for cross-sample RNAseq comparison ?

2

Entering edit mode

4.6 years ago

CrazyB ▴ 280

I tried but failed to find any "guideline" from TCGA consortium on how to compare RNA data generated from the consortium. I wonder if anyone could help point to the right direction. Thanks

The intents here are to compare the expression of gene X (or genes X1, X2, X3 ... etc) across multiple sample sets (e.g. Lung vs. Breast vs. Brain tumors), not to perform differential gene expressions within each sample and then to compare across multiple sample sets. (although also not sure if these two different intents need to use different methods/units of RNA to compare)

Per my understanding, we probably should use a TMM-based or equivalent method to normalize TCGA RNA expression data before we compare the expression across samples, However, I could not find any formal documentation/publication on this.

Efforts and background on my part

(a) I am not a stats-trained, or bioinformatics-trained researcher

(b) I've read the info in

https://haroldpimentel.wordpress.com/ and https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html

(c) although TCGA provides documentation on RNAseq analysis https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#examples

they stop at FPKM or FPKM-UQ for RNA expression normalization (what's the value of knowing FPKM if you cannot compare them among samples ??)

(d) several TCGA publications I read did not provide details in materials and methods on RNAseq "analyses". To my surprise, the ones with some details use references from microarray era, not discussing FPKM/RPKM/TPM/TMM related issues.

RNA-Seq • 1.5k views

ADD COMMENT • link updated 4.6 years ago by dsull ★ 7.2k • written 4.6 years ago by CrazyB ▴ 280

score 1 · Answer 1 · 2020-07-14

I don't have a definitive answer but can offer my personal perspective:

TCGA unfortunately doesn't provide best practices for analysis of the data and I haven't seen any consistent way of analyzing the data.

You can use the raw counts and do a between-sample sequencing depth normalization (like what you described you in your post). But even then, TCGA data is messy, there are batch effects (not to mention a ton of other covariates), etc. You can try to adjust for some of these (e.g. after normalization, correct batch effects using COMBAT) but that will take a large amount of effort.

The TCGA pan-cancer project provides a full matrix of all the cancers here: https://gdc.cancer.gov/about-data/publications/pancanatlas See the file EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv This is based off the Broad Institute's Firehose pipeline -- they do normalization and some degree of batch effect correction. A lot of papers (including those in top journals) use the expression data directly from here. However, I cannot say that this pipeline is a "best practices" approach to RNA-seq analysis.

(As a biologist, I like doing well-controlled, not-confounded experiments on model organisms rather than trying to make sense of thousands of samples-worth of messy data.).