Hello biostars,
Currently, I'm working on my master thesis, and I got some feedback on one of my plots with the suggestion to add the relation between my groups in the boxplot with ggpubr. I've been working with the tutorial on datanovia, but for some reason, the function won't find one of my columns.
I've made a subset of my data here:
TSS_TE_subset <- data.frame(
genome = factor(c("A1","A1","A1","A1","A1","D1","D1","D1","D1","D1","JR2","JR2","JR2","JR2","JR2"), levels=c("A1","JR2","D1")),
distance = c(3299, 2999, 2117, 4228, 2565, 3260, 2515, 578, 1893, 612, 1333, 771, 2093, 1886, 192))
I want to compare my data with a wilcoxon test, using JR2 as a sample:
TSS_TE_subset_stat <- wilcox_test(TSS_TE_subset, distance ~ genome, ref.group="JR2") %>% add_significance()
Which creates a table looking like this:
# A tibble: 2 x 9
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
<chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 distance JR2 A1 5 5 0 0.008 0.016 *
2 distance JR2 D1 5 5 9 0.548 0.548 ns
Which looks very much like it is supposed to look, as far as I know. However, when I use my table with data and this statistics table to create a boxplot with the P-values, it doesn't work:
TSS_TE_subset_boxplot <- ggplot(TSS_TE_subset, aes(x=genome, y=distance, fill=genome)) +
geom_boxplot() +
theme_classic() +
labs(
x="Verticillium genome",
y="TSS-TE distance",
title="A upregulated"
) +
scale_fill_manual(values=genome_colors) +
stat_pvalue_manual(TSS_TE_subset_stat, label = "{p.adj}", tip.length = 0.01, y.position=5000) +
geom_jitter(width=0.4, height=0, shape=".", alpha=0.4)
> TSS_TE_subset_boxplot
Error in FUN(X[[i]], ...) : object 'genome' not found
Now I can see it says it cannot find something called "genome," which is a column in my dataframe. I know it's there:
> glimpse(TSS_TE_subset)
Rows: 15
Columns: 2
$ genome <fct> A1, A1, A1, A1, A1, D1, D1, D1, D1, D1, JR2, JR2, JR2, JR2, JR2
$ distance <dbl> 3299, 2999, 2117, 4228, 2565, 3260, 2515, 578, 1893, 612, 1333, 771, 2093, 1886, 192
When looking a bit into it, I found some people grouping the data before doing the statistics, but that created a very similar error even before making the plot:
TSS_TE_subset_stat <- dplyr::group_by(TSS_TE_subset, genome) %>% wilcox_test(distance ~ genome, ref.group="JR2") %>% add_significance()
Error in `mutate()`:
! Problem while computing `data = map(.data$data, .f, ...)`.
Caused by error in `stop_subscript()`:
! Can't extract columns that don't exist.
x Column `genome` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
Honestly, I'm at a loss here. There must be something wrong with my data, but I cannot figure out what. glimpse() and class() confirm that the dataframe and its columns are valid. Rstudio has no issues displaying the data. And when I generate a boxplot without the use of ggpubr, it works fine:
TSS_TE_subset_stat_2 <- kruskal.test(distance ~ genome, TSS_TE_subset)$p.value
TSS_TE_subset_boxplot_2 <- ggplot(TSS_TE_subset, aes(x=genome, y=distance, fill=genome)) +
geom_boxplot() +
theme_classic() +
labs(
x="Verticillium genome",
y="TSS-TE distance",
title="A upregulated",
subtitle=paste0("Kruskal-Wallis P-value: ",as.character(signif(TSS_TE_subset_stat_2, digits=5)))
) +
scale_fill_manual(values=genome_colors) +
geom_jitter(width=0.4, height=0, shape=".", alpha=0.4)
But that doesn't display the individual relations to the reference, as it does a Kruskal-Wallis comparing the three, rather than two of them to the reference strain individually.
Does anyone know what is going wrong and how to fix it, or to add P-values to a boxplot in a similar way? Many thanks in advance.
Also, in case that may be the culprit:
- R version 4.1.1
- Rstudio version 1.4.1717
- Windows 10
Modified from here (http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/):
That works great! I have adapted it into the figure that I want to create. Thanks for your help!