Hi,
I'm right now working on a cancer subtype classifier project. I had problem to make the classifier from microarray and RNAseq data agree with each other. In anyway the classifier build from the two data source have some how 20% disagreement. (I know because I have same common samples, and the two classifier classified 20% of them differently, also both classifier performed good on it's own data source).
I don't see a question here, just a statement of what you've done. I assume the question is either (A) why might the classifier give such discordant results when trained on the different data types or (B) how might you try to avoid this issue. In either case, please update your post so we know what your actual question is.
Thanks, I updated below.
Hi Devon Ryan,
thanks a lot.
Sorry for being ambiguous. Exactly, I think you point out both. First I'm surprised about the discordant, but can't find a way to solve so far.
Hi Irsan,
thanks for pointing out. It is very helpful indeed.
I got the log2 transformed microarray data and rnaseq expression data from public database. The correlation of the two dataset is around 0.72. The correlation between the two platform can really reach so high (0.95)? I cannot achieve a good agreement of classifiers seams likely because of the low correlation, am I right?
Histogram plot of RNAseq data, log2(x+1) transformed. I filtered out non expressed genes already. However, a large peak at 0 still. This due to some genes are only expressed at one or two samples, 0 at most samples.
Histogram plot of microarray
Thanks again. The gene correlation actually has only on peak. Based on your idea, I ranked the gene correlation, and choose only the high correlated genes for classification, but I cannot end with a good classifier with low error rate.
ps: how I did the classification
From the consensus clustering, I decided to classify the expression data into 4 groups. Based on the group annotation of clustering, I train a classifier by PAM. Of course, much less genes are selected from the gene set for clustering are used for training classifier (around 500).