Hi,
I would like to cluster/make PCA among microarray samples accross two different platforms.I am afraid that clustering on the common genes between the platforms would be influenced more by the platform (different probes measuring different sequences of the transcripts and on different scale) then the treatment effect. As there is generally better consistency of the upregulated processes (enriched GO terms, pathways) I would like to cluster based on GO terms.
Suppose cells treated with compound A, B, C or D (each done in several replicates). Compare them to untreated control and that yields lists of differentially regulated genes. Determine GO terms (say for upregulated genes) GO.A, GO.B, GO.C and GO.D. This would be measured on platform 1. Then I would have cells treated with compound E, compared them to untreated control etc. to get GO.E. This experiment would be on platform 2. I would like to know, how similar is the effect of treatment E to A, B, C and D.
One solution that comes to my mind is first find common GO terms that are present on both platforms. Then compute GO.A, GO.B, GO.C, GO.D and GO.E. The GO terms not significantly changed (upregulated) would get p value 1. So I would have p values for all of the common GO terms. Then I would do for example PCA on the p values (I think they should be scaled first) and see the distance among the samples.
Does this make sense? Is there a better way?
Any suggestions appreciated!
Vojta
It's an interesting approach. However I think variables used for PCA should be in principle independent from each other. GO terms on the other hand are structured as a tree, and I am not sure if this would break the principle of independency.
Yes to me that is one of the concern if it is breaking the independency factor but then again is it viable to see 1x1 DEGs and then see the GO, if it is cross platform then ideal would be cross platform normalization and then find DEGs for the 4x4 samples to give a more statistically viable DEGs on which GO can be performed and then represented semantically.
thank you for your insight. do you think using enriched pathways instead of GO would amend this? or do you have in mind other way how to compare samples based on GO where the tree structure would not be problem?
It all depends on what you want to categorize as pathways. In GO enrichment the Biological Process is also closely associated to specific pathways or even Molecular Function is translated into pathways. So in a way you are trying to see how enriched are your genes for specific molecular functions (MF) or biological process (BP) and if some pathways which stands for your hypothesis are enriched from any of the categories in BP or MF then bingo that will help you to restrict your gene list. Usually when I refer to pathway I try to see pathways in KEGG or Ingenuity or Reactome. But they are more like downstream biological answers that corresponds to specific design. I guess you are looking for a preliminary approach that will help you so actually proceed with GO terms and either do a PCA on them or a correlation plot to see which are the terms that are closely associated. However am if you are looking for PCA should not it be done on the enrichment scores rather than pvalues? So you can select the significant GOs with pvalues along with their enrichment scores and then make a common venn diagram to see how all the enrichment scores behave across all the samples for the common GO and then either make a heatmap or PCA or correlation plot to make an understand how each samples are distanced.
Some link could be informative :
thanks for the links, however they focus mainly on reducing of the datasets to common and expressed genes. this still retains some bias, I read one should verify that the probes target the same transcript region. however, generally the processess upregulated/downregulated on different platforms correspond more than the sole genes Li et al 2009
I may do the GO analysis anyway and compare that manually (see side by side which lists are similar), but I thought there would be some better approach :)