Here's a simple approach, related to what Ben and Sean have said, but there are some things you'll need to clarify. With measurements on 300 samples, this likely represents data from several experiments, and you'll have to be explicitly clear about what your data actually represents. Affymetrix technology measures transcript levels per sample. However, most experiments are designed such that one is assessing changes in transcript levels between conditions, thus changes are relative, and absolute abundance is not known. In terms of assessing co-expression, knowing whether your data represents absolute expression levels across 300 conditions, or relative levels across 300 conditions is important because the distance measures you would use to define similarity would imply different things in each case. Which measure does your data table represent? Affymetrix abundance data is often turned into relative abundance data by creating ratios of experiment over control for each gene, which is very useful in general in terms of thinking about the biology, but then absolute abundance information is lost. Either way, whether you have 300 ratios of gene expression, or 300 intensity measurements (i.e. abundance) of gene expression, you have a profile of gene expression. So the next thing to be clear about is how similarity of profiles is quantified, and what is implied vis a vis coexpression. If you have absolute expression data, and the definition of co-expression is a gene with the closest abundance profile, then you would use Euclidean distance as a similarity measure (this is the default measure using R's heatmap function). However, there's no hard and fast rule that two genes which are co-expressed across a plethora of biological stimuli are each expressed at the same concentration within the cell, so perhaps similarity of profile (regardless of absolute abundance) is sufficient, in which case correlation would be a good measure. By the way, if this doesn't make sense, look up and think about what each thing measures. consider three genes with the following profiles. g1: 125,400,800,1200; g2: 125,400,800,1200; g3: 425,700,1100,1500. You can see that g1 and g2 have identical profiles, and by both Eulidean and correlation distance measures, they are identical. However, assessing their similarity to g3, by Pearson correlation, g1, g2, and g3 are all identical, whereas by Euclidean distance g3 is different than g1 and g2. When it comes to ratios of expression, absolute abundance is out the window, but you can still assess similarity of profiles. From a biological perspective, similarity of profile is often considered co-expression, but you should think about the implications for how the measures above score similarity when examining profiles composed of gene expression ratios.
The answer to your question may depend on certain particulars of your data (thus be clear about it). Define your data set, define co-expression, define your purpose. But in my experience, for most biological problems, I would say try a number of clustering methods, see how they differ, see what they offer you in terms of organizing your data. I've found that when I create toy data sets with known combinations of profiles, there is no one method or solution that can pull them all out and re-organize them perfectly. Depending on your level of expertise, an easy package that allows you to experiment with many methods, visualize the results, and know hardly anything going in, is called MeV (MulitExperimentViewer).
I have a similar query. My basic idea is to identify transcription factor binding site (TFBS) upstream of keratin gene. I want to do de-novo discovery based on over-represented sequence search in regulatory regions of keratin gene. Therefore, my first idea is to search genes co-expressed with my target gene (i.e.) keratin. I have 15 RNA seq gene expression data developed from developing feather cells. What strategy I can use to sort the co-expressed genes?
Thanks
Dr. M. Joyraj Bhattacharjee
Your post is not an answer to a the original question. Instead, you should ask a new question.