Question

CPTA, TCGA, RNA seq, validation

0

Entering edit mode

2.3 years ago

Rob ▴ 170

Hi friends Hope you all doing well.

I want to validate my TCGA analysis with CPTAC transcriptomic data. I don't know why in the validation, my model classifies all patients as one phenotype (I expected two: with and without a feature).

Do you guys have any idea why this is happening? Do you also know what the CPTAC data is? is it z-score? I transformed my TCGA data to z-score to be consistent with CPTAC data.

CPTAC TCGA RNA-seq • 1.8k views

ADD COMMENT • link updated 2.3 years ago by Ernest Bonat ▴ 30 • written 2.3 years ago by Rob ▴ 170

0

Entering edit mode

hello,

could you explain more about what TCGA are you using? feel free post 3-5 rows and select the features and label(s)? to see the possible binary classification machine learning project.

ADD REPLY • link 2.3 years ago by Ernest Bonat ▴ 30

0

Entering edit mode

Thanks @Ernest for responding. I used HT seq raw count data of TCGA. I normalize and the calculate Z-score. then I make model and I apply the best model (classifier) for CPTAC data to validate my work.

This is my TCGA data after normalization and converting to Z-score:

ADD REPLY • link 2.3 years ago by Rob ▴ 170

0

Entering edit mode

thanks @Rob, i understand that data scaling (normalization) is the next step after data split in Machine Learning project workflow, but why the need to calculate the z-score? Can you share the link where you download the HT seq raw count data of TCGA? Feel free to read the following blog paper: Apply Machine Learning Algorithms for Genomics Data Classification.

ADD REPLY • link 2.3 years ago by Ernest Bonat ▴ 30

score 1 · Answer 1 · 2022-07-26

1

Entering edit mode

2.3 years ago

i.sudbery 20k

My understanding is that CPTAC is a proteomics project, and therefore the measurements will be proteomics data, where as the TCGA data is RNA-seq (amoung other things), and therefore transcriptomics. I think its is not surprising that when you apply proteomics data to a model trained on transcriptomics that it doesn't work.

Transcript level is not perfectly correlated with protein level (far from it in some cases). In addition RNAseq data will quantify many genes that are not in the proteomics data (such as non-coding RNAs, different splice isoforms which may produce the same or different peptides etc). In addition each techniques is subject to different biases.

ADD COMMENT • link 2.3 years ago by i.sudbery 20k

0

Entering edit mode

My bad, there appears to be transcriptome data in CPTAC as well.

ADD REPLY • link 2.3 years ago by i.sudbery 20k

0

Entering edit mode

yes, you made good points. i saw some mRNA downloads sites include a file with normalized z-score dataset too. I would like to know if this is the best practice?

ADD REPLY • link 2.3 years ago by Ernest Bonat ▴ 30

0

Entering edit mode

No, a best practice would simply be the raw unchanged counts because normalization of RNA-seq data is trivially simple starting from these counts via packages like edgeR or DESeq2 (it is really just a one-liner), same with standardization or any simple transformation like log2. Providing these transformed values for download, often without any code, is just an annoying blackbox (all imo).

ADD REPLY • link 2.3 years ago by ATpoint 85k

0

Entering edit mode

sure, but you will need to scale the x features in machine learning before fitting the models anyway...

ADD REPLY • link 2.3 years ago by Ernest Bonat ▴ 30