I'm not sure it'd make sense to tidy a matrix of that size for use with ggplot2. Below is how to read a simple file produced by computeMatrix in R (the example file I'm using below can be found in the deepTools source code under deeptools/test/test_data/
):
m = read.delim("computeMatrixOperations.mat.gz", skip=2, header=F)
m = as.matrix(m[,-c(1:6)])
The resulting matrix is exactly what's plotted in plotHeatmap
, though you can also use pheatmap()
if you really want:
pheatmap(m, cluster_rows=F, cluster_cols=F) # This is a useless heatmap, I just use it for testing
Note that there are likes NAs in the data, which cause a lot of things to have issues during clustering.
One big caveat to all of this is that you don't see delineations by region groups and samples. That information is stored in the header:
> h = scan("computeMatrixOperations.mat.gz", n=1, sep="\n", what=character())
Read 1 item
> h
[1] "@{\"verbose\":true,\"scale\":1,\"skip zeros\":false,\"nan after end\":false,\"sort using\":\"mean\",\"unscaled 5 prime\":0,\"body\":1000,\"sample_labels\":[\"SRR648667.forward\",\"SRR648668.forward\",\"SRR648669.forward\",\"SRR648670.forward\",\"SRR648667.reverse\",\"SRR648668.reverse\",\"SRR648669.reverse\",\"SRR648670.reverse\"],\"downstream\":0,\"unscaled 3 prime\":0,\"group_labels\":[\"genes\"],\"bin size\":10,\"upstream\":0,\"group_boundaries\":[0,196],\"sample_boundaries\":[0,100,200,300,400,500,600,700,800],\"missing data as zero\":false,\"ref point\":null,\"min threshold\":null,\"sort regions\":\"no\",\"proc number\":20,\"bin avg type\":\"mean\",\"max threshold\":null}"
That's a json string, which I don't think base R has a function to parse. The important part is that the bit after sample_boundaries
denotes the beginning and end of each sample (the labels are in sample_labels
) and group_boundaries
/group_labels
does the same for groups of regions. So you can subset matrices in the normal way in R accordingly.
For computing the equivalent of plotProfile
's output, you can simply apply(m, 2, function(x) mean(x, na.rm=T)
(be sure to handle NA values!).
I usually prefer creating plots in R, but if it'd be a good bit of work to just reproduce a moderately complicated equivalent to plotHeatmap
in R, so I would encourage you to take the advice from Friederike and tweak the python code in deepTools instead.
I think deeptools have pretty nice adjustments to generate very good heatmaps/profile plots. What do you think is missing?
Check this: Visualization of ChIP-seq data using Heatmaps (Updated: 06/10/16)
I agree that the plots are nice and an easy way to visualize the data, but don't like the aesthetic for publication quality figures. I think there are way more attractive plots being published using other packages and want to go that route (i.e. matplotlib, pheatmap, ggplot etc).
In regards to heatmaps I can assure you that it will require a lot of fiddling around to achieve similarly good-looking results in R (which is something you've realized already, hence the question, I assume). EDIT: I have never had enough patience so far to go further than what Devon has supplied. And I much prefer R for graphics.
The aspects that I tend to want to change about the deepTools heatmaps are mostly concerning the labels. Those can more easily be changed by going the python route (i.e., using the original source code and changing the aspects you feel need change).
You are correct ;)
Anyone have a better solution?
@Kevin has some nice links here: How can I generate Heat Map with dendograms, and PCA analysis in "R Programming" for this type of data?
have you looked through the links venu provided?
FYI, matplotlib is exactly what's used to create plots by deepTools.
can you specify which aspects you don't find attractive enough?
Thanks so much for your help everyone! I'll definitely start by tweaking the Python code in deepTools and dig into the R code when I have time.