So I'm currently interested in playing with inferring gene regulatory networks from microarray data. I've downloaded a longitudinal study from GEO and I'm playing with various available BioC packages out there. I was wondering what kind of pre-processing people use/recommend for this kind of task. I've done lots of differential expression kind of things, but have no experience with this.
Are there a different set of protocols / methods for this? Which normalisation method would people recommend?
UPDATE:
So Will kind of answered this a while ago. I was wondering if anyone had anymore suggestions: for example in this paper biomedcentral.com/1752-0509/1/37 they remove probes with a maximum intensity value < 5 . All of these thresholds seem arbitrary to me and a bit rubbish. Any more comments / ideas greatfully recieved. I m not sure of the proper etiquette to re-open a question.
so Will kind of answered this a while ago. I was wondering if anyone had anymore suggestions: for example in this paper, they remove probes with a maximum intensity value < 5 . All of these thresholds seem arbitrary to me and a bit rubbish.
Any more comments / ideas gratefully received.
I m not sure of the proper etiquette to re-open a question.
Your comment on minimum gene expression cut-offs bears some further thought. Weakly expressed probes on any microarray platform are less reliable overall than strongly expressed probes. These probes will tend to show much greater variance but are less likely, overall, to provide true signal. It is desirable to reduce the influence of probes that have a low prior probability of being informative, particularly when using a statistical framework that penalizes you the number of tests you perform. A separate but interlinked issue is that many inference techniques (e.g. Bayesian network techniques) are incredibly expensive computationally as the number of nodes becomes non-trivial.
Although some normalization methods provided by the manufacturers provide a means of calling "present/absent", with RMA you do not get this sort of a call. As a practical solution, I often look at the normalized expression level of negative controls for guidance as to what is likely to be an uninformative probe. These probes may represent genes that are truly expressed, genes you'd see with a more sensitive technique such as RT-PCR, but with the microarray you have to be pragmatic and rule them out because they mostly 1) contribute to false positives that cannot be replicated 2) in bulk, greatly reduce your statistical power. I often use a simple heuristic that rules in probes where either mean expression across all probes is above X, or some number of probes is above higher value Y. This avoids excluding probes where most samples are at background but a few samples have high expression, as these may be biologically very interesting. The outcome may be an arbitrary-looking threshold, but in a pragmatic field this is not a disqualifying feature so long as the threshold is sensible and the authors communicate how it was arrived upon.
I agree with Will. There's no evidence that I'm aware of that the method of normalisation affects the ability to later do network inference. I guess it's not impossible that differences in the resulting expression distributions between normalisation methods could affect the power of the inference. Why don't you try it and see (and let us know here !).
I've heard people say that different normalisation methods cause problems with the dynamic range of the expression values obtained. I've heard conflicting reports......that MAS5 is better than RMA (and vice-versa)
Oh and as an aside, true GRN network inference is currently as good as impossible with networks of any reasonable size (with underlying models that actually bear any resemblance to real biological networks). Most extant work is with dozens of nodes at most. This is partly a parameterisation and/or computational power issue as the frameworks for integrative dynamic belief networks exist. It is also a function of very data sparse (e.g. Transfac/Jaspar) and noisy (e.g. ChIP-Seq) transcription factor binding data.
so Will kind of answered this a while ago. I was wondering if anyone had anymore suggestions: for example in this paper, they remove probes with a maximum intensity value < 5 . All of these thresholds seem arbitrary to me and a bit rubbish.
Any more comments / ideas gratefully received.
I m not sure of the proper etiquette to re-open a question.