I'm working on an analysis which includes 450k methylation data. There are so many probes that analysing the whole data set is becoming a problem in terms of time and memory. I'm sure that nearby methylation sites are highly correlated, so is there some kind of informative subset of the whole probset I could use, to reduce computational costs without losing too much information? I'm aware that it's possible to do this myself using clustering or something, but I was hoping it had been done already.
I have 389 samples. I agree that it's best to use as much information as possible, but I'm already running on a cluster and am still having memory issues.