I would like to elaborate a bit on Woa's answer.
Let's imagine you have the following dataset:
> set.seed(2)
> d = data.frame("B1"=rnorm(100),"B2"=rnorm(100), "B3"=rnorm(100), "B4"=rnorm(100), "B5"=rnorm(100), "B6"=rnorm(100), "B7"=rnorm(100), "B8"=rnorm(100))
> d$id = row.names(d)
> d
B1 B2 B3 B4 B5 B6 B7 B8 id
1 -0.89691455 1.0744594 0.2979836 -0.3181198 -0.2140756 -0.4597894 -1.1150718 1.23874433 1
2 0.18484918 0.2605978 -1.0195522 -0.3154903 -2.7218162 0.6179261 -0.1142184 0.23189621 2
3 1.58784533 -0.3142720 2.8708974 0.8843223 -1.0142618 -0.7204224 -0.8946214 -0.31443788 3
4 -1.13037567 -0.7496301 0.2187100 -1.8854213 -0.8291451 -0.5835119 -0.6540889 1.49970370 4
5 -0.08025176 -0.8621983 -0.9665543 0.7321793 0.8577089 0.2163245 1.1787163 0.06957437 5
6 0.13242028 2.0480403 0.3838382 0.7905447 -0.2385101 1.2449912 0.9515165 1.33403372 6
To plot a histogram of a column using ggplot, you can use the qplot function:
> qplot(B1, data=d, geom='histogram')
To plot multiple histograms, you can add a geom_histogram for each property:
> qplot(B1, data=d, geom='histogram', fill=I('green')) + geom_histogram(aes(B2), data=d, fill='red')
Since it would be impractical to add a new geom_histogram for each column, you can melt the dataframe, transforming it to a long format:
> d.long = melt.data.frame(id.var='id', data=d)
> head(d.long)
id variable value
1 1 B1 -0.89691455
2 2 B1 0.18484918
3 3 B1 1.58784533
4 4 B1 -1.13037567
5 5 B1 -0.08025176
6 6 B1 0.13242028
Note how the long format is structured. All the values are stored in the "value" column. The "variable" column keeps tracks of the original columns. Each data point is also determined by an unique id.
Transforming your dataset to a long format is an essential step for plotting multiple distributions together. Most R functions, such as ggplot2, and others like anova, assume that your data is in the long format. Now that you have a dataset in the long format, you can use plot all the histograms in a single statement:
> qplot(value, fill=variable, data=d.long, geom='histogram')
If you look in the documentation for geom_histogram, you will see that there are many ways to arrange the histograms. For example, you can use position='dodge' to put all the values separately:
> qplot(value, fill=variable, data=d.long, position='dodge')
In my opinion, if there are too many columns, it is better to use the density geom instead of the histogram, using a degree of transparency:
> qplot(value, fill=variable, data=d.long, geom='density')
If there are too many columns, one alternative is to plot some histograms on the negative y axis:
> qplot(value, fill=variable, data=subset(d.long, variable %in% c("B1", "B2", "B3", "B4")), position='dodge', geom='density', alpha=0.2) + geom_density(aes(y=-..density..), data=subset(d.long, variable %in% c("B5", "B6", "B7", "B8")))
# histogram version:
> qplot(value, fill=variable, data=subset(d.long, variable %in% c("B1", "B2", "B3", "B4")), position='dodge', geom='histogram', alpha=0.2) + geom_density(aes(y=-..count..), position='dodge', data=subset(d.long, variable %in% c("B5", "B6", "B7", "B8")))
Finally, another approach is to use faceting to plot each property in a different panel:
> qplot(value, fill=variable, facets=~variable, data=d.long)
I prefer to use density plots, try this:
This is by far not what I want to show. I don't need the distribution, but the actual numbers
The title of the question mentions histogram, I assumed you wanted distribution.
note how taking the actual numbers from an histogram can be misleading. These numbers are very dependent on the size of the histogram bins, and if the bins are too high you risk to merge together two or more different distributions.
Passing
add=T
in the next call tohist()
will add the second histogram to the same plotting area. But so far this does not look like a good method of displaying the distribution, I'd consider either removing the 0 size inserts or using kernel density estimates or transforming your data (or some combination of the three).No I can't as the 0's are important for the results (it's a long story) :-)
You can still make a useful plot though, e.g. with a broken y-axis that jumps from ~50 to ~2100, unless the only point of the plot is to emphasise that there's a lot of 0s.
ok, I can do that. But what about combining the histograms together.
add=T
in your subsequent call tohist()
with add=T it creates a stacked barplot. I would like to have the bars next to each other for each of the group/data sets.
No it overplots a second histogram to the same axes (the bars aren't stacked, just plotted on top of each other). It seems what you're really asking for is just a simple
barplot
(not histogram) withbeside=TRUE
Important or not, this plot in this shape doesn't say much. I suggest cutting Y
ylim=c(0,100)
, and add textbox to show the number of Zero values:well, to be honest it does! The idea behind it is to show, that some data sets have very few hits in the bigger bins (15000 onwards). Other data sets show a lot more hits on the right hand side. This is why I would like to plot them together, but keep the colors (or use completely different colors).
As it stands this is an R programming question. Please explain the relevance to a bioinformatics research problem.