I have to use 2 data sets. one of them for train and another one for test. My train data set belongs to TCGA and its workflow type is FPKM-UQ (also RNA-seq by HTSeq-FPKM and HTSeq-count are available in GDC portal) but my test RNA-seq was normalized by Transcript per million (TPM) method. My problem is for getting reasonable result should I use these 2 dataset convert to unique normalized method or I can use one data set by FPKM-UQ normalization method and one data set for testing my model by Transcript per million (TPM) method?
I appreciate if anybody share his/her comment with me.
Best Regards,
Mohammad
Dear Dr. Warden I have no access to controlled access TCGA data. basically, I would like to solve my problem based on computationally methods for available data. For this reason I want to know when my test data set normalization method is TPM for training data set using which workflow type of TCGA dataset is good: FPKM, FPKM-UQ , or HTseq-count?
I appreciate if you share your comment with me.
Best Regards
Hi,
I can't really promise any method will be the best option ahead of time - I would expect some troubleshooting / benchmarking to be necessary for each project.
The best option that I can think of is to split the TCGA up into training and validation datasets (or, really validation set #1 and validation set #2). You can test out different strategies for the "training" or "validation set #1" and try to resist the temptation to check the "validation" or "validation set #2" until you start to feel comfortable with an option. The real test comes from testing predictions in completely new samples (and you may decide to use something lower throughput and less expensive than RNA-Seq for that). However, that is currently the best guideline that I can think of.
Best Wishes, Charles