Question

merge my own RNA-seq data with TCGA UCSCXena data sets. and proper normalisation

0

Entering edit mode

4 weeks ago

theodore ▴ 90

Hello all, and thank you in advance for your input.

I was reading this publication where they used TCGA data from the TCGA concortium and I would like to merge in the dataset RNA-seq data from patients n=5 of another type of cancer. One data set has featurecounts counts and the other one RSEM, as in the TCGA. The questions are

I assume that using different count program it will not affect very dramatically the data, well there are publications that claim this or the other, I am unsure.
Regarding normalisation I would assume that before merging my own data set I would need to do something like this:

    featureCounts_table <- read.table(paste0(dir.in, "geneCounts.txt"),head = T, sep = "\t",skip = 1, row.names = "Geneid")
    gene.length <- featureCounts_table$Length
    convertCounts(my_countMatrix+1, "TPM",  gene.length, log) #from DGEobj.utils cran package

How does TCGA Xena handles batch effects? Is it relevent if I only use the data for the GSVA package (single sample GSEA approach)

this is how I am getting the data from XENATCGA:

xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query,  destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE)
xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,  destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE)

As a note I found the following statments: "For the PANCAN gene expression dataset, we combined all the data from all the TCGA cohorts. This data is mean normalized across the entire cohort in the visualization (https://genome-cancer.ucsc.edu/proj/site/help/#Normalize_columns). Each individual cohort was Level_3 Data (file names: *.rsem.genes.normalized_results), all of which were downloaded from TCGA DCC and then were log2(x+1) transformed."

Xena TCGA normalisation RNA-seq • 618 views

ADD COMMENT • link 6 days ago by theodore ▴ 90

0

Entering edit mode

You can analyze your data independently and then do a meta-analysis using both (TCGA and yours). At that point you can look at pathway overlap, etc. As long as your metadata is consistent, you can draw some useful conclusions.

ADD REPLY • link 4 weeks ago by Radu Tanasa ▴ 140

0

Entering edit mode

Thank you for your answer, just to clarify, you recommend to loop the gene expression matrix, per samples, and parse it to GSVA idependetly and then merge the data/the GSVA geneset scores to start doing some meta analyis.

ADD REPLY • link 4 weeks ago by theodore ▴ 90

score 0 · Answer 1 · 2024-11-12

0

Entering edit mode

13 days ago

Zhenyu Zhang ★ 1.2k

For a careful analysis, how about you download data from the GDC, and run your samples via the GDC RNA-Seq pipeline. Then you can use the started DESeq/EdgeR/Limma analysis control for covariates.

ADD COMMENT • link 13 days ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

the whole RNAseq pipeline from fastq to counts of all those thousand samples?

ADD REPLY • link 13 days ago by theodore ▴ 90

0

Entering edit mode

I don't know how many samples you have. TCGA samples are already called, so that you only need to call your own samples.