Hi! This may be a silly question but I have been unable to find an answer that works for me. I am working with TCGA data for prostate cancer. I downloaded level 3 data which consists in raw counts for the expression of 549 samples. 52 of them where paired (healthy/unhealthy tissue for the same patient). Using edgeR I was able to construct a list of differentially expressed genes between healthy/unhealthy tissue, hopefully leaving out the "specific sample bias".
Using edgeR I calculated the log2(counts) for the expression of the top 1000 differentially expressed genes and applied them to the 549 samples (70% for learning/30% for test). I got pretty good results (around 90% accuracy in the testing set). I used kmeans, knn and random forest approaches.
I wanted to see if these results where applicable to other datasets, so I downloaded expression data at ICGC (200 cancer samples) and GTEX (around another 200 healthy samples). To my surprise both healthy and unhealthy samples are classified as healthy!!
Does anybody have any clue about where my problem might be? I guess I am not normalizing the counts properly and that's why my ML methods don't work in different databases. Thanks everybody in advance!
have you tried to look at the ML model too? What processing did you do for ICGC and GTEx?