Question

Which data to choose for correlation examination

0

Entering edit mode

15 days ago

lyan125 ▴ 10

Hello I want to perform WGCNA or Spearman between multiple groups in TCGA pancancer data.

I am currently using Xena browser to download data and cBioPortal for mutation examination. I am undecided whether I should use "tcga_RSEM_gene_tpm" which is already TPM normalized, and contains 60k features but probably without batch correction and it will be difficult "to correct" with so many features.. or I should use "EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena" which is already after batch effect removal, however, contains much fewer features and will need an additional normalization.

I appreciate any help you can provide

correlation WGCNA TPM batch spearman • 315 views

ADD COMMENT • link 15 days ago by lyan125 ▴ 10

score 0 · Answer 1 · 2024-12-11

0

Entering edit mode

15 days ago

Zhenyu Zhang ★ 1.2k

I will use the batch-corrected data. I assume the second dataset you mentioned is the same as this one EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv in https://gdc.cancer.gov/about-data/publications/pancanatlas

Btw, I feel weird that now people like to download data from Xena and cBioPortal. How are you going to quote your data source and method if you do find something worth publishing? Why don't you get the data from the original PanCan publications where you have the full details on how data were generated.

ADD COMMENT • link 15 days ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

Thank you

Yes, of course, it's just very convenient and fast, much faster than downloading from publications or GDC itself It's good to save some time. As long as I acknowledge the way they did to get that kind of data, why shouldn't I use Xena or cBioPortal?

ADD REPLY • link 15 days ago by lyan125 ▴ 10

0

Entering edit mode

Just a small note about this method. After performing multiple tasks, I would need to intersect the results (the genes) with 2 additional data frames, both of them TPM/FPKM normalized. If I understand correctly, this one can't be normalized to this mode unless I use raw counts with each transcript length to do so, am I correct?

ADD REPLY • link 15 days ago by lyan125 ▴ 10