Question

TCGAbiolinks: which normalization before differential expression analysis (legacy=TRUE vs. legacy=FALSE)

0

Entering edit mode

3.2 years ago

erica.fary ▴ 20

Dear All,

I am following the TCGAbiolinks tutorial for conducting differential expression analysis on TCGA data ("TCGAanalyze: Analyze data from TCGA" section). I have 2 questions about it.

1) I don't understand the following: when dealing with legacy=TRUE data (platform = "Illumina HiSeq", file.type = "results"), they perform normalization to correct gene length (TCGAanalyze_Normalization with default parameter); but when they are dealing with legacy=FALSE data (workflow.type = "HTSeq - Counts"), they perform normalization to correct GC content (TCGAanalyze_Normalization with method = "gcContent"). What is the reason for that ? Do you have any explanation ?

2) if I want to use the TCGAanalyze_DEA function with pipeline=limma, should I use the same normalization methods as for pipeline=edgeR ? otherwise, which one should I use for the legacy=FALSE and legacy=TRUE data, respectively ?

Hope you could help a bit. Thanks in advance !

Erica

TCGAbiolinks limma TCGA RNA-seq normalization • 1.6k views

ADD COMMENT • link updated 3.2 years ago by fracarb8 ★ 1.7k • written 3.2 years ago by erica.fary ▴ 20

score 0 · Answer 1 · 2021-09-03

If you look at the Query tab, they say that

There are two available sources to download GDC data using TCGAbiolinks: 
- GDC Legacy Archive : provides access to an unmodified copy of data that was previously stored in CGHub and in 
  the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh37 (hg19) and GRCh36 (hg18). 
- GDC harmonized database: data available was harmonized against GRCh38 (hg38) using GDC Bioinformatics Pipelines 
  which provides methods to the standardization of biospecimen and clinical data.

That means that legacy refer to data as it was provided to them and that it is not harmonized (e.g. everything normalised and scaled to be comparable between projects). You need to look at the documentation (or find somewhere in the portal) the protocols they used, so that you know what and where was normalised/scaled/raw.

The best approach, would be to download the raw counts using GDCquery. Based on this post, it should be possible.