Hi All,
I am currently working on clustering microarray data to find tumor subtypes. My data come from multiple GEO studies, and are all based on the Affymetrix U133 Plus 2.0 array. All samples have been log2-transformed and RMA normalized (on study basis). For the needs of the analysis, I have come up with the following workflow:
- Combine all arrays (tumors) into one file.
- Define batch effects.
Remove batch effects using (a) pamr and (b) sva.
Q: Is it ok to apply these batch correction procedures to log2 data? Or shall I delogarithmize the data beforehand?
Delogarithmize the data.
Q: Do you think that it would be better not to delogarithmize the data before standardization?
Standardize the data using R (standardize rows, that is, genes).
- Cluster all tumors using ConsensusCluster (use k-means with Euclidean distance and SOM).
Select genes whose expression profile differs between the classes found as a result of the clustering (genes that pass a t-test p-value of 0.000001).
Q: Is it ok to use log2-transformed, RMA normalised and batch corrected data for the t-test (do not standardize)?
What flaws do you see in this workflow?
Best regards,
Marcin.
I'd recommend co-normalizing all arrays (not use the per-study normalization as is). At step (6) you'll need to decide which genes you will use in the clustering. Using all genes on the HGU133Plus2 is generally not a good idea, as many probesets will contribute mainly noise. Instead, use a subset of genes that reflect biological distinctions, defined by (for example) a variance or Absent/Present call filter. Euclidean distance on standardized log-scale data would be OK. It is OK to apply batch corrections to log2-scale data. It would also be OK to run within-gene t-tests or the like, on log2-transformed, RMA normalized and batch corrected data. If you have more than 2 putative tumor classes, I assume you would run (for example) an ANOVA analysis, not a t-test, at step (7), to identify genes that differ among the putative subtypes. If you are attempting to build a predictor of putative tumor subtypes, set aside a portion of your samples to serve as a test set, in order to evaluate your classifier.
Thank you for valuable insight, Ahill. I have got one more question: Would it be also ok to run these batch correction procedures (pamr, sva) on delogarithmized data?