Hi all,
I have created this histogram using the same data in R Studio, but I am wondering about the difference in appearance between the histograms generated by ggplot2 and base R.
I understand that Base R (hist()
): The number of bins is determined automatically unless specified otherwise using the breaks argument. This default behavior can lead to different bin widths compared to ggplot2.
ggplot2: You explicitly set the number of bins (bins = 30), so the binning will be consistent based on this. However, in base R, unless you control it, the binning algorithm might create fewer or more bins depending on the data distribution.
Please can someone clarify this based on my code? Thanks in advance
#### code for R base
#####plot (A)
hist(rsv_n_1_9hpi$polya_length,
#xlab = "HRSV_9_hpi",
cex.lab = 1.5,
cex.axis = 1.5,
cex.main = 1.5,
cex.sub = 1.5,
ylab="Count",
xlab="Poly(A) tail length")
#####Plot(B)
hist(rsv_n_1_9hpi$polya_length,
#xlab = "HRSV_9_hpi",
cex.lab = 1.5,
cex.axis = 1.5,
cex.main = 1.5,
cex.sub = 1.5,
ylab="Count",
xlab="Poly(A) tail length",
breaks = 30,
ylim = c(0, 1500),
)
### code for ggplot2
####plot (C)
b<-ggplot(rsv_n_1_9hpi, aes(x = polya_length)) +
geom_histogram( bins = 30, fill = "blue",
color = "black", alpha = 0.7) +
#xlab("Poly(A) tail length") +
#ylab("Count") +
theme_light()+
# ylim(0,1500)+
labs(title = "HRSV (9_hpi, n=1)")+
scale_x_continuous(limits = c(0, 1200),breaks = c(0, 200, 400, 600,800,1000))+
theme(plot.title = element_text( size = 15),
axis.text = element_text(colour = "black", size=13),
axis.title.y = element_text(size = 13),
legend.text = element_text(size = 13),
strip.text.x = element_text(size = 13),
axis.title.x = element_text(size = 13))
Thanks. But could you please explain why the height of the bars in the plot changes when I adjust the number of bins?
Is this explanation correct: The height of the bars changes because when you increase or decrease the number of bins, the data gets distributed across more or fewer bars.
More bins: The data is divided into smaller intervals, so each bin will contain fewer data points, resulting in shorter bars. Fewer bins: The data is grouped into larger intervals, so each bin contains more data points, resulting in taller bars.
If you took all the bars in each plot and stacked them on top of each other, you should get the total number of points in your dataset.
Here's a random set of 1000 numbers from a uniform distribution between 0 and 100. As you increase the number of bins, your average bar height decreases.