I've found that the data structure required for plotting can sometimes be confusing for people first learning R. More specifically, the concept of long versus wide data. In order to illustrate this concept, I will show an example of making a simple box plot of expression from a toy RNA-seq count matrix.
Other R tutorials:
- data.frames in R: How vectorization can save you time and heartache
- Factors in R - How do they work, and how did I break my data?
- Pipes in R: An introduction and some advanced tips and tricks.
I first generate some example data to work with.
mat <- cbind(
t(replicate(4, rnorm(3, runif(1, 10, 500), runif(1, 1, 25)))),
t(replicate(4, rnorm(3, runif(1, 10, 500), runif(1, 1, 25))))
)
dimnames(mat) <- list(
sprintf("ENS%06d", seq(100, 103)),
c(sprintf("EV_%02d", seq_len(3)), sprintf("KD_%02d", seq_len(3)))
)
> mat
EV_01 EV_02 EV_03 KD_01 KD_02 KD_03
ENS000100 273.1218 274.6152 264.7982 216.8490 198.8390 220.9580
ENS000101 514.5558 456.3855 487.5547 342.2199 329.2495 339.9095
ENS000102 332.7703 342.6765 338.4658 291.3258 291.4720 299.3414
ENS000103 476.9946 479.9996 515.0222 200.8888 150.8901 147.1827
When thinking about long versus wide data, I like to think about how many "variables" my data has. In this count matrix we have genes (row names), samples (column names), and expression (matrix elements). To plot this data with ggplot2, each one of these variables should be their own column, or in long format. Right now the samples variable (and accompanying expression data) is spread among multiple columns, making this data wide format.
There are multiple ways to convert this data from wide to long format. First though, we should turn this count matrix into a data.frame, making sure that the gene names are their own variable column (instead of rownames). I provide base R, data.table, and tidyverse solutions, but will be using the tidyverse functions as display.
# base R
df <- as.data.frame(mat)
df$genes <- rownames(df)
rownames(df) <- NULL
# data.table
library("data.table")
DT <- as.data.table(mat, keep.rownames="genes")
# Tidyverse
library("dplyr")
library("tidyr")
df <- as_tibble(mat, rownames="genes")
> df
# A tibble: 4 x 7
genes EV_01 EV_02 EV_03 KD_01 KD_02 KD_03
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENS000100 273. 275. 265. 217. 199. 221.
2 ENS000101 515. 456. 488. 342. 329. 340.
3 ENS000102 333. 343. 338. 291. 291. 299.
4 ENS000103 477. 480. 515. 201. 151. 147.
Now that you have a properly formatted data.frame, you can change your data from wide to long format. For all methods described below, there will be a few important arguments. First, you need to specify the columns you want to collapse, in this case the columns with the expression values. Second, the column names (samples) will be placed into a single column, so you must name this column. Third, the elements of the collapsed columns (expression) will also be placed into a single column, so you must name that as well. I use grep to get all of the column names I want to collapse, but they can also be specified manually in a vector.
# base R
df <- reshape(
df, direction="long", idvar="genes", v.names="expression", timevar="sample",
varying=grep("EV|KD", colnames(df), value=TRUE),
times=grep("EV|KD", colnames(df), value=TRUE)
)
# data.table
DT <- melt(
DT, measure.vars=grep("EV|KD", colnames(DT), value=TRUE),
variable.name="sample", value.name="expression"
)
# Tidyverse
df <- pivot_longer(df, -genes, names_to="sample", values_to="expression")
> head(df)
# A tibble: 6 x 3
genes sample expression
<chr> <chr> <dbl>
1 ENS000100 EV_01 273.
2 ENS000100 EV_02 275.
3 ENS000100 EV_03 265.
4 ENS000100 KD_01 217.
5 ENS000100 KD_02 199.
6 ENS000100 KD_03 221.
You can now notice that our variables genes, samples, and expression are all represented in columns. This means our data is now in long format. Before we plot we are going to separate the sample type from the replicate number, so that the sample types can be plotted together. This can actually be done simultaneously with the pivoting functions with this example, but is separated here for clarity.
# base R
df <- cbind(df, data.frame(do.call('rbind', strsplit(as.character(df$sample),'_'))))
df$sample <- NULL
colnames(df)[colnames(df) %in% c("X1", "X2")] <- c("sample", "replicate")
# data.table
DT[, c("sample", "replicate") := tstrsplit(sample, "_")]
# Tidyverse
df <- separate(df, sample, into=c("sample", "replicate"), sep="_")
> head(df)
# A tibble: 6 x 4
genes sample replicate expression
<chr> <chr> <chr> <dbl>
1 ENS000100 EV 01 273.
2 ENS000100 EV 02 275.
3 ENS000100 EV 03 265.
4 ENS000100 KD 01 217.
5 ENS000100 KD 02 199.
6 ENS000100 KD 03 221.
Finally, we can use out newly formatted long data for plotting using ggplot.
library("ggplot2")
ggplot(df, aes(x=genes, y=log2(expression), fill=sample)) +
geom_boxplot(position="dodge", width=0.25) +
theme(text=element_text(size=16), axis.text.x=element_text(angle=45, hjust=1))
For more on pivoting data see vignette("datatable-reshape", package="data.table")
for data.table, and vignette("pivot", package="tidyr")
for tidyverse.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/bioconductor/lib/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.3.2 tidyr_1.1.1 dplyr_1.0.1 data.table_1.12.8
loaded via a namespace (and not attached):
[1] withr_2.2.0 crayon_1.3.4 grid_4.0.2 R6_2.4.1
[5] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5 scales_1.1.1
[9] pillar_1.4.6 rlang_0.4.7 vctrs_0.3.2 generics_0.0.2
[13] ellipsis_0.3.1 glue_1.4.1 munsell_0.5.0 purrr_0.3.4
[17] compiler_4.0.2 colorspace_1.4-1 pkgconfig_2.0.3 tidyselect_1.1.0
[21] tibble_3.0.3