Question

Microarray Expression Data For Network Analysis

4

Entering edit mode

13.2 years ago

pixie@bioinfo ★ 1.5k

I have a couple of Microarray datasets belong to different experiments collected from ArrayExpress and GEO for a particular disease.

I have analyzed them, built a coexpression matrix for each of the datasets ( Pearson Correlation Coefficient: >=0.8)and I have built a coexpression network combining all these matrices using Cytoscape.

The problem is, even for the same disease, there are hardly any common gene across all the datasets, probably due to different platforms/populations/conditions used.

The net result is that I am getting a network with isolated network components.I am not able to validate the statistical parameters of the network ( topological coefficients/node degree distribution etc.) in this case.

Is there a way around the problem?

network microarray • 4.4k views

ADD COMMENT • link updated 13.2 years ago by David Quigley 11k • written 13.2 years ago by pixie@bioinfo ★ 1.5k

1

Entering edit mode

I think you have stumbled upon a common problem with MA analysis, weak reproducibility of disease related gene-lists, between different experiments on different platforms and different designs, which has been reported in meta-studies e.g. for MA analysis of cancer. Except lowering the correlation threshold, I don't believe there is much that can be done about this. However, I would be surprised if the different MA platforms don't even share reporters for a large common subset of human genes, even though the measurements might not correlate. Could you maybe focus on a single platform only?

ADD REPLY • link 13.2 years ago by Michael 55k

1

Entering edit mode

@olbzn: I had used the software called Cladist. You have to give the expression values across all samples of the datasets as the input (separately for up and down regulated genes). You can fix the parameters according to your choice. You can go through the tutorial..its pretty simple.

ADD REPLY • link 12.9 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

The problem which I am working on is disease specific(diabetes) and tissue specific (skeletal tissue). With these restrictions, we have about 5 datasets. If I impose a restriction on the platform as well, I would hardly have anything left. I was however looking for coexpression databases and was trying to find the list of coexpressed genes, but not sure if this procedure is right...

ADD REPLY • link 13.2 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

@Sanchari I found your post as I was looking for a way to build a coexpression matrix. I have two independent datasets and would like for each to generate a coexpression matrix (probably pearson>0.9 to start) for each. It seems you have done just that. Would you mind letting me know how you did it exactly? Thanks!

ADD REPLY • link 12.9 years ago by Olbzn ▴ 180

score 2 · Answer 1 · 2011-08-29

Now, despite my a little pessimistic comment, I had some ideas what could be helpful for the analysis.

First of all, your findings are not totally surprising to me. Assume that in any such experiment a large number of genes are not affected by the treatment or experimental conditions in questions. Therefore, any correlation present in a single condition 1 might be masked by random fluctuation in conditions 2-5. On the other hand, given the noise level of MA experiments, correlation could appear randomly, so you need a stringent cut-off. It is already interesting that you found some correlation structure, and worth looking at these cliques, even though isolated nodes remain. So, here are my 50ct, they are just ideas.

Use update or optimized layout annotation files as in ffcccc's answer. I like the idea, but that might only work for Affy chips and similar where you have the definition files.
The previous point implies to re-run the analysis on the raw-data including normalization and summarization.
Try a more robust correlation coefficient, e.g. kendall or spearman rank correlation (though this might as well yield even worse scores)
Play with the correlation cutoff. How far do you have to lower it to reduce the number of isolated graphs, maybe it's just a little bit below 0.8.
Do you see highly connected cliques?
For isolated graphs, you can compute centroid/medoid vectors, and compare them to other centroid vectors, see what is typical for their correlation pattern. Try to establish a link to other sub-graphs by means of the centroid/medoid vectors rather than the individual expression pattern.
Work only with those genes that are called significant in each study or which are significant after e.g. applying limma to all 5 data-sets combined.
Apply other methods, e.g. GSEA, GO-Analysis, cluster-analysis, etc.

Hope this helps, someone might come up with better ideas in the light of the biological question you are addressing.

score 1 · Answer 2 · 2011-08-29

Try varying the correlation cut-off. You don't actually say why you picked 0.8, so I assume it was arbitrary. When I do this analysis, I typically set a 5% GWER correlation threshold using a permutation approach (see Churchill and Doerge Genetics 1994). This value will vary with the size of the dataset you have; with larger datasets (e.g. N>80) it will usually be much lower than 0.8. If your datasets are too small (e.g. N<30) this approach is unlikely to work. At lower thresholds you are more likely to see overlap due to common biological functions. Just remember that whatever cut-off you choose will have to be justified to your reader.

Ram · Answer 3 · 2011-08-29

Bicciato et al. worked directly on chip definition files to get best results pooling data. I think the work can be extended to different platforms.

Quote from ... Finding Gene Coexpression In Geo? :

...I was impressed by a presentation of S.Bicciato's work on GEO data and I suggest you to look at "Novel definition files for human GeneChips based on GeneAnnot"[PMID:18005434] and "Strategies for comparing gene expression profiles from different microarray platforms: application to a case-control experiment"[PMID:16624241] as starting point if you want to go this way. Otherwise you could turn to a meta-analysis approach, avoiding the bias of merging data, but I imagine this could open a new thread in the blog..

and also :

The RNAnet tool does this.

...maybe only Affymetrix HG133 data ? Please let us know your experience.