Newbie question probably, but I want to know if it makes sense to do batch effect correction (via SVA's ComBat) using only parts of a RNA-seq expression matrix. The problem is that the matrix I'm working with can't be loaded into my R workspace as a whole (too big), what I was thinking of doing was to partition the matrix into gene based subsets. So I would simply set the files to a manageable amount of rows and process them 1 by 1.
Since all samples always appear on each partition, then I would be able to use the same adjustment variables matrix and batch IDs vector for all partitions. Never done batch effect corrections before so I don't know if not using the whole matrix renders the method pointless. My guess is no since from what I understood from the method's paper (to be honest I didn't get the fine details of the method) the expression of one gene doesn't interact with the expression of another at the moment of the calculations, so it shouldn't be essential to process all genes in one go, however I definitely need someone else's advice for knowing that I'm not messing up here.
I'm not worried about downstream analysis since I can do what I want to do by parts, my PC has 8GB RAM and the matrix is a bit over 11 GB. In fact I've already done several things to the matrix by parts successfully, but since I don't fully understand this method I'm not sure if I can make it work
I see... that is a large dataset indeed.
When either searching for surrogate variables via SVA, or directly adjusting for a batch variable via ComBat, doing either of these in 'chunks' will not work as you hope, I believe. In the case of ComBat, it will introduce yet another extraneous / surrogate variable into your data that is represented by the very 'chunks' that you have used to break up the dataset, i.e.,
chunk
will become your new batch variable, while, what was the original batch variable will have been corrected for in each individual chunk. This may actually be sufficient for what you want to do, depending on whatever it is you're doing...that sounds reasonable, however taking a look at the paper again I notice in step one of the algorithm (standardize data) that maybe the full matrix of genes is expected. (talking about this paper)
Yes, that relates to why I said that doing it your way will just introduce another [new] batch effect relating to the chunks that you have chosen. It's not ideal and will come under criticism if you try to publish.
I would honestly try to just use the entire matrix. Can you not use a local HPC or just rent an Amazon instance for a few hours?
yeah I will consult for options with my colleagues, I will have access to a 16 RAM PC in February, but that would mean delaying work.
Hi Mike,
if you are planning to rent an Amazon instance to run R here is a guide that helped me a lot: link
An amazon intance with 16 Gb of RAM is really cheap. You can find the price here: link