merge my own RNA-seq data with TCGA UCSCXena data sets. and proper normalisation
1
0
Entering edit mode
29 days ago
theodore ▴ 90

Hello all, and thank you in advance for your input.

I was reading this publication where they used TCGA data from the TCGA concortium and I would like to merge in the dataset RNA-seq data from patients n=5 of another type of cancer. One data set has featurecounts counts and the other one RSEM, as in the TCGA. The questions are

  1. I assume that using different count program it will not affect very dramatically the data, well there are publications that claim this or the other, I am unsure.

  2. Regarding normalisation I would assume that before merging my own data set I would need to do something like this:

    featureCounts_table <- read.table(paste0(dir.in, "geneCounts.txt"),head = T, sep = "\t",skip = 1, row.names = "Geneid")
    gene.length <- featureCounts_table$Length
    convertCounts(my_countMatrix+1, "TPM",  gene.length, log) #from DGEobj.utils cran package
  1. How does TCGA Xena handles batch effects? Is it relevent if I only use the data for the GSVA package (single sample GSEA approach)

this is how I am getting the data from XENATCGA:

xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query,  destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE)
xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,  destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE)

As a note I found the following statments: "For the PANCAN gene expression dataset, we combined all the data from all the TCGA cohorts. This data is mean normalized across the entire cohort in the visualization (https://genome-cancer.ucsc.edu/proj/site/help/#Normalize_columns). Each individual cohort was Level_3 Data (file names: *.rsem.genes.normalized_results), all of which were downloaded from TCGA DCC and then were log2(x+1) transformed."

Xena TCGA normalisation RNA-seq • 601 views
ADD COMMENT
0
Entering edit mode

You can analyze your data independently and then do a meta-analysis using both (TCGA and yours). At that point you can look at pathway overlap, etc. As long as your metadata is consistent, you can draw some useful conclusions.

ADD REPLY
0
Entering edit mode

Thank you for your answer, just to clarify, you recommend to loop the gene expression matrix, per samples, and parse it to GSVA idependetly and then merge the data/the GSVA geneset scores to start doing some meta analyis.

ADD REPLY
0
Entering edit mode
11 days ago
Zhenyu Zhang ★ 1.2k

For a careful analysis, how about you download data from the GDC, and run your samples via the GDC RNA-Seq pipeline. Then you can use the started DESeq/EdgeR/Limma analysis control for covariates.

ADD COMMENT
0
Entering edit mode

the whole RNAseq pipeline from fastq to counts of all those thousand samples?

ADD REPLY
0
Entering edit mode

I don't know how many samples you have. TCGA samples are already called, so that you only need to call your own samples.

ADD REPLY
0
Entering edit mode

OK, that makes sense. I will try that. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6