Hello to everybody. In these days I am performing some analyses on TCGA RNAseq data using the R bioconductor package "TCGAbiolinks". I have a simple question to answer regarding the type of data that I can download. Basically, there are three types of RNAseq data that you can download for illumina RNAseq strategy, that are: 1) HTseq - Counts, 2) FPKM 3) FPKM-UQ.
Now, HTseq should be the raw counts of the analysis, which I can normalize with other functions in the package or other packages, while FPKM and FPKM-UQ should be the already normalized counts using these two methodologies.
My question is related to this fact: when I start the analysis with HTseq-Counts and perform myself the normalization/filtering procedure, at the end of all the steps I have only 1/3 of the total genes that I have at the beginning (roughly 56'000 at start, 17'000 at end); conversely, if I download the FPKM or FPKM-UQ, I obtain the already normalized data of the same 56'000 genes, that I just need to filter for low count values etc, getting roughly 40'000 genes at the end. So, my question is: it is correct to download the data already normalized and proceed with just the filtering procedure, so to keep overall more genes in all the analysis? Or it is in any case better to start with the raw counts and normalize by myself (but loosing a lot of genes)? Or I am doing something wrong?
Here some example code with HTseq-Counts:
query.exp <- GDCquery(project = TCGAprj,
legacy = F,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = brcds.exp.filt,
sample.type = c("Solid Tissue Normal", "Primary Tumor"))
# Download the data of the query:
GDCdownload(query.exp)
# Prepare data:
COAD.exp <- GDCprepare(query.exp,
save = TRUE,
summarizedExperiment = TRUE,
save.filename = "COADexp.rda")
dim(COAD.exp)
[1] 56424 519
# 56424 genes
# normalization for GC content
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
geneInfo = geneInfoHT,
method = "gcContent")
dim(dataNorm)
[1] 23166 519
# 23166 genes
# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
dim(dataFilt)
[1] 17374 519
# 17374 genes
If needed I will paste even some code for the FPKM analysis.. Thank you in advance for your answers!!
Thank you for your kind answer Kevin! I will try different parameters/functions for normalization and see what I get from raw counts. Have a nice day! :-)