Hello all, and thank you in advance for your input.
I was reading this publication where they used TCGA data from the TCGA concortium and I would like to merge in the dataset RNA-seq data from patients n=5 of another type of cancer. One data set has featurecounts counts and the other one RSEM, as in the TCGA. The questions are
I assume that using different count program it will not affect very dramatically the data, well there are publications that claim this or the other, I am unsure.
Regarding normalisation I would assume that before merging my own data set I would need to do something like this:
featureCounts_table <- read.table(paste0(dir.in, "geneCounts.txt"),head = T, sep = "\t",skip = 1, row.names = "Geneid")
gene.length <- featureCounts_table$Length
convertCounts(my_countMatrix+1, "TPM", gene.length, log) #from DGEobj.utils cran package
- How does TCGA Xena handles batch effects? Is it relevent if I only use the data for the GSVA package (single sample GSEA approach)
this is how I am getting the data from XENATCGA:
xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query, destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE)
xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query, destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE)
As a note I found the following statments: "For the PANCAN gene expression dataset, we combined all the data from all the TCGA cohorts. This data is mean normalized across the entire cohort in the visualization (https://genome-cancer.ucsc.edu/proj/site/help/#Normalize_columns). Each individual cohort was Level_3 Data (file names: *.rsem.genes.normalized_results), all of which were downloaded from TCGA DCC and then were log2(x+1) transformed."
You can analyze your data independently and then do a meta-analysis using both (TCGA and yours). At that point you can look at pathway overlap, etc. As long as your metadata is consistent, you can draw some useful conclusions.
Thank you for your answer, just to clarify, you recommend to loop the gene expression matrix, per samples, and parse it to GSVA idependetly and then merge the data/the GSVA geneset scores to start doing some meta analyis.