I have a fairly complex microarray experiment for which I think limma might be a good approach, but I’m having some difficultly figuring out how to define the problem in limma and so was hoping to get some advice if limma is actually appropriate or if I should be looking at a different approach.
- The data consist of microarrays probing ~20,000 genes.
- The data come from 2 separate experiments that were batch corrected using CombatR.
- Each patient was repeatedly measured at different timepoints: before and after treatment, and during an additional, intermediate treatment step (but only for one of the batches). This intermediate step is predicted to have an effect on the after treatment gene expression.
- There are some missing samples so not all patients have the 2 or 3 timepoints which were originally measured.
- Patients belong to 1 of 2 classifications prior to treatment.
- Patients are classified as good or poor outcome (eventually we may assign multiple levels, but for now a binary classification)
So:
- Batches: A, B
- Timepoints: 1 (A & B), 2 (B), 3 (A & B)
- Class: C, D
- Outcome: G, P
In terms of the comparisons we’re most interested in:
- 1 vs 2, 1 vs 3 (maybe 2 vs 3)
- C vs D at each timepoint (how they differ from each other and change over time)
- A vs B @ timepoint 3 (as this defines the effect of additional treatment at timepoint 2), timepoint 1 could be a control for batch correction
- G vs P @ any timepoint, but 1 being the most useful for prediction
I can also imagine other potential interactions that may be interesting:
- Class vs Outcome
- Batch vs Outcome
- Batch vs Class
I do understand that batch and effect of treatment at timepoint 2 are confounded, but I don’t think there’s anything I can do about this other than propose a second validation study for any significant differences we find.
A few specific questions:
- Should I define every possible contrast and interaction at the start in case any turn out to be of interest or is it acceptable to do a more broad exploratory pass then add contrasts and interactions for those factors that show significance?
- Does doing this in limma correct for the multiple contrast/interaction comparisons (I understand I need to do a correction across the 20,000 genes, but I’m not clear on if/how to correct across contrasts/interactions)?
- Can limma handle repeated measures and missing values (in one case all of timepoint 2 is missing from one of the batches)?
If limma is not appropriate, can you suggest a more appropriate method?
I would not be offended by any help in constructing the matrices… But I don’t expect anyone to do the work for me. If I know what I’m trying to do is feasible then I can figure it out.
Thanks!