I have a gene-gene correlation matrix of about 20,000 genes by 20,000 genes. I am trying to generate a heatmap similar to how JMP can create plots like this:
Image from Sunil Archak (https://sscnars.icar.gov.in/Genetics/11-%20jmp_exp.pdf)
I want to see how the resulting data will look. Unfortunately, as I was typing this R Studio Aborted the session as it was trying to generate the heatmap. Too much to handle?
Any ideas?
Alternatively, I was advised to plot genes by principle components. I may be leaning towards this approach now... unless someone knows a solution to this. I do have a decent amount of cores (16) and and decent amount of RAM 128GB.
Any help would be appreciated.
Very Respectfully, Pratik
If you have access to a cluster, it might help to use that and generate a file output instead of interactive graphic output.
Thank you for respond sir. Let me try this...
Sir, do you know if there is a way to parallelize the process though, because it is not an issue with computing power? The issue here is R and the R packages I am using are not using the full computing power. Nearly the whole server I am using, is idle... If there is some way to use all of the processors, this would speed up the process substantially for me.
Very Respectfully, Pratik
Unfortunately, no. If the package is not designed to use multiple cores, there's not much you can do about it.
Maybe try in the standard R (command-line)? R Studio uses unnecessary extra resources.
pvclust can do clustering in a parallelised fashion, but it does not generate heatmaps.
If you want, start a new session on command-line and show the output of
sessionInfo()
?Thank you for your response sir. I understand now why @_r_am was suggesting to access a cluster for this. This is a huge job even for my personal server, I think? I'm asking the machine to plot a 20,000 genes x 20,000 genes matrix with colors associated with different levels of correlation. The matrix is 2.7GB alone.
Do you think hierarchical clustering using pvclust and then using the generated clusters to plot a heatmap/dendrogram will speed up the computation, or will hierachical clustering be yet another layer of data ontop of the gene-gene correlations?
EDIT: I'm pretty sure this is what I wanted to do from the beginning just not sure how to... This was also the suggestion I got from a mentor. Cluster first and then plot in dendrogram-heatmap. But I'm wondering about how computationally tractable it will be to do this?
I guess I could set it up either way, maybe? genes by clusters if I want to? or genes by genes with cluster labeling? The latter would probably be more computationally heavy? (I think that these types of dendrogram-heatmaps only work on a square matrix so the latter may be required for this?
The next challenge, I think, would be figuring out how to take the cluster assignments from pvclust, group one of the 20,000 genes by 20,000 genes axes into the cluster assignments, and then use a package like Heatmap2.0 to create a beautiful plot like this:
Image obtained from: https://doi.org/10.3389/fnhum.2015.00440
Please correct me if I'm wrong on this! Or if you have any pointers on this process, please.
Here is my sessionInfo():
I know it's not a purely command-line server (using Ubuntu desktop), but it does have some decent computing power.
Thank you again @_r_am and @Kevin Blighe.
Very Respectfully, Pratik
See if you can do that with ComplexHeatmap. It's quite extensible and might allow for the clustering and dendrograms to be precomputed - or you should atleast be able to pre-compute row and column orders and separate the clustering compute consumption from the graphical rendering time consumption.
Thank you very much sir! I will try this!
Very Respectfully, Pratik