Question regarding TCGA and TCGAbiolinks --> RNAseq data as HTseq-Counts, FPKM, FPKM-UQ
1
0
Entering edit mode
3.8 years ago
mario.red8976 ▴ 130

Hello to everybody. In these days I am performing some analyses on TCGA RNAseq data using the R bioconductor package "TCGAbiolinks". I have a simple question to answer regarding the type of data that I can download. Basically, there are three types of RNAseq data that you can download for illumina RNAseq strategy, that are: 1) HTseq - Counts, 2) FPKM 3) FPKM-UQ.

Now, HTseq should be the raw counts of the analysis, which I can normalize with other functions in the package or other packages, while FPKM and FPKM-UQ should be the already normalized counts using these two methodologies.

My question is related to this fact: when I start the analysis with HTseq-Counts and perform myself the normalization/filtering procedure, at the end of all the steps I have only 1/3 of the total genes that I have at the beginning (roughly 56'000 at start, 17'000 at end); conversely, if I download the FPKM or FPKM-UQ, I obtain the already normalized data of the same 56'000 genes, that I just need to filter for low count values etc, getting roughly 40'000 genes at the end. So, my question is: it is correct to download the data already normalized and proceed with just the filtering procedure, so to keep overall more genes in all the analysis? Or it is in any case better to start with the raw counts and normalize by myself (but loosing a lot of genes)? Or I am doing something wrong?

Here some example code with HTseq-Counts:

query.exp <- GDCquery(project = TCGAprj,
                      legacy = F,
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",                  
                      barcode = brcds.exp.filt,
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"))

# Download the data of the query:
GDCdownload(query.exp)

# Prepare data:
COAD.exp <- GDCprepare(query.exp, 
                       save = TRUE, 
                       summarizedExperiment = TRUE, 
                       save.filename = "COADexp.rda")
dim(COAD.exp)
[1] 56424   519

# 56424 genes


# normalization for GC content
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent") 

dim(dataNorm)
[1] 23166   519

# 23166 genes


# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

dim(dataFilt)
[1] 17374   519

# 17374 genes

If needed I will paste even some code for the FPKM analysis.. Thank you in advance for your answers!!

RNAseq FPKM-UQ FPKM TCGAbiolinks TCGA • 2.7k views
ADD COMMENT
1
Entering edit mode
3.8 years ago

Between HTseq raw counts, FPKM, and FPKM-UQ, I would definitely obtain the HTseq raw counts. There is not much utility in using FPKM or FPKM-UQ other than for filtering out genes, i.e., using FPKM units should be reserved for QC purposes.

With the HTseq raw counts, you can either proceed with the TCGAbiolinks functions for normalisation, and modify parameters to these functions if you feel that the filters are too harsh, or you can extract these raw counts and perform normalisation via a third party program, like EdgeR or DESeq2.

By the way, ~17000 genes sounds correct to me, i.e., if one is interesting only in protein coding genes that are expressed above background noise.

Kevin

ADD COMMENT
1
Entering edit mode

Thank you for your kind answer Kevin! I will try different parameters/functions for normalization and see what I get from raw counts. Have a nice day! :-)

ADD REPLY

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6