Hi there, I have a problem with my hierarchical clustering method and I appreciate if anyone could help me in advance. Let me start from the first step, in order to identify differentially expressed genes in some microarray studies (each study consist of 3 individual dataset, collectively I have 15 dataset) I use limma package from bioconductor, R. Then I filtered out those genes with adj. P-value less than 0.05. After that, I extracted a set of genes which involved in the cell cycle for example. Finally, this set of genes with there expression base on log fold change were used for hierarchical clustering. As I read before for log-transformed data Euclidean distance measurement method with complete linkage is the best for my data but the problem is when I clustered 15 dataset, surprisingly data from the same study stand close together in one cluster. What can I do for this mistaken view? Would it possible to use only one control for all treatment data from a different study in R? Or another approach would be taken?
Many thanks in advance
Can you show the design matrix, and especially if and how you checked and/or compensated for potential batch effects?
Here is the photo of heirarchical clustring
I think my mistake is I dont consider the batch effect. I normiliza each study separetly then I clustered them together. How can I compensate batch effect? In what way? Would it a good idea to normiliza all datasets together? But I dont know how could it possible. Any suggestion?
Please edit this post and see the changes I've made to see how to add images properly.
Images should be added using the image button, not the link button. You'll need the direct link to the image, not the link to the page hosting the image.
If you normalize separately then this result is totally normal and expected as the datasets of the single studies are only scaled within the study but not to each other. If you do z-scoring then you at least have to normalize them all together, not discussing if comparing values from different studies makes sense due to the batch effect.
I know, but I want to normalize them to compensate batch effect and to find which data is close to CM without considering what dataset is belong to which study. Any way?!