Question

Question regarding TCGA and TCGAbiolinks --> RNAseq data as HTseq-Counts, FPKM, FPKM-UQ

0

Entering edit mode

4.1 years ago

mario.red8976 ▴ 140

Hello to everybody. In these days I am performing some analyses on TCGA RNAseq data using the R bioconductor package "TCGAbiolinks". I have a simple question to answer regarding the type of data that I can download. Basically, there are three types of RNAseq data that you can download for illumina RNAseq strategy, that are: 1) HTseq - Counts, 2) FPKM 3) FPKM-UQ.

Now, HTseq should be the raw counts of the analysis, which I can normalize with other functions in the package or other packages, while FPKM and FPKM-UQ should be the already normalized counts using these two methodologies.

My question is related to this fact: when I start the analysis with HTseq-Counts and perform myself the normalization/filtering procedure, at the end of all the steps I have only 1/3 of the total genes that I have at the beginning (roughly 56'000 at start, 17'000 at end); conversely, if I download the FPKM or FPKM-UQ, I obtain the already normalized data of the same 56'000 genes, that I just need to filter for low count values etc, getting roughly 40'000 genes at the end. So, my question is: it is correct to download the data already normalized and proceed with just the filtering procedure, so to keep overall more genes in all the analysis? Or it is in any case better to start with the raw counts and normalize by myself (but loosing a lot of genes)? Or I am doing something wrong?

Here some example code with HTseq-Counts:

query.exp <- GDCquery(project = TCGAprj,
                      legacy = F,
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",                  
                      barcode = brcds.exp.filt,
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"))

# Download the data of the query:
GDCdownload(query.exp)

# Prepare data:
COAD.exp <- GDCprepare(query.exp, 
                       save = TRUE, 
                       summarizedExperiment = TRUE, 
                       save.filename = "COADexp.rda")
dim(COAD.exp)
[1] 56424   519

# 56424 genes


# normalization for GC content
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent") 

dim(dataNorm)
[1] 23166   519

# 23166 genes


# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

dim(dataFilt)
[1] 17374   519

# 17374 genes

If needed I will paste even some code for the FPKM analysis.. Thank you in advance for your answers!!

RNAseq FPKM-UQ FPKM TCGAbiolinks TCGA • 2.9k views

ADD COMMENT • link 4.1 years ago by mario.red8976 ▴ 140

score 1 · Answer 1 · 2021-03-22

Between HTseq raw counts, FPKM, and FPKM-UQ, I would definitely obtain the HTseq raw counts. There is not much utility in using FPKM or FPKM-UQ other than for filtering out genes, i.e., using FPKM units should be reserved for QC purposes.

With the HTseq raw counts, you can either proceed with the TCGAbiolinks functions for normalisation, and modify parameters to these functions if you feel that the filters are too harsh, or you can extract these raw counts and perform normalisation via a third party program, like EdgeR or DESeq2.

By the way, ~17000 genes sounds correct to me, i.e., if one is interesting only in protein coding genes that are expressed above background noise.

Kevin