I have created some "grouped" boxplots in R, regarding the expression of a subset of 12 genes, for 3 cluster groups of samples, based on a previous clustering methodology result. The gene expression, is VST transformed HTSEQ counts. The code used for the creation of the included figure:
head(dat) # the data frame of the genes/features in the columns and the 1 categorical variable
NCOA1 CCT7 UGP2 ACADM
TCGA-3L-AA1B-01A-11R-A37K-07 10.329101 13.32549 11.29148 9.800935
TCGA-AU-6004-01A-11R-1723-07 10.793586 12.91526 11.15353 9.919037
TCGA-T9-A92H-01A-11R-A37K-07 10.198103 13.73892 12.14109 10.518959
TCGA-CK-5913-01A-11R-1653-07 10.704988 13.59675 11.73051 10.586667
TCGA-AD-6889-01A-11R-1928-07 10.284720 14.10074 11.52742 10.707753
TCGA-CM-5860-01A-01R-1653-07 9.863118 13.23791 11.66066 10.566241
GSTP1 CAT KIT CD44 BOP1
TCGA-3L-AA1B-01A-11R-A37K-07 13.47565 10.88197 9.476305 13.50063 11.60254
TCGA-AU-6004-01A-11R-1723-07 13.63630 11.90729 7.080705 13.95125 12.05972
TCGA-T9-A92H-01A-11R-A37K-07 14.59698 12.16112 9.445610 13.15624 12.31314
TCGA-CK-5913-01A-11R-1653-07 13.39063 11.65145 7.198912 14.24373 11.97289
TCGA-AD-6889-01A-11R-1928-07 14.21625 11.38295 6.053052 13.62892 11.68580
TCGA-CM-5860-01A-01R-1653-07 14.33711 11.63726 7.670905 13.61599 11.26274
UGDH ACADS NPM1 Cluster_Group
TCGA-3L-AA1B-01A-11R-A37K-07 10.67613 10.48779 14.69374 EC1
TCGA-AU-6004-01A-11R-1723-07 11.06180 11.12601 14.27851 EC3
TCGA-T9-A92H-01A-11R-A37K-07 10.18600 10.87468 14.64535 EC1
TCGA-CK-5913-01A-11R-1653-07 11.42366 11.13081 14.82264 EC3
TCGA-AD-6889-01A-11R-1928-07 11.99969 11.38377 14.70659 EC3
TCGA-CM-5860-01A-01R-1653-07 11.75133 10.47124 15.10752 EC3
df.m <- melt(dat, id.var = "Cluster_Group")
p <- ggplot(data=df.m,aes(x=variable,y=value))
p <- p + geom_boxplot(aes(fill=Cluster_Group))
p <- p + geom_point(aes(y=value,group=Cluster_Group),position=position_dodge(width=0.75))
p <- p + facet_wrap(~variable,scales="free")
p <- p + xlab("gene symbol") + ylab("vst transformed counts") + ggtitle("Gene Expression Differences in 3 Patient Clusters")
p <- p + guides(fill=guide_legend(title="Cluster_Membership"))
Here is the link to the created plot:
https://www.dropbox.com/s/nkh2qmth3szsrda/3clusters.ggplot2.12genes.survival.jpeg?dl=0
My main concern here, is whether i can also include significance levels between the 3 groups in each boxplot in each gene ? In order to illustrate any significant differences in mean expression, in any of the pairwise group comparisons ? My notion for this, is to further select and prioritize some genes, based on a parallel survival analysis of these groups of clusters, as all the 12 genes were used.
Thank you in advance
If I understand the question correctly, you may want to explore the
ggsignif
package, which I believe extendsggplot2
's functionality for the addition to significance bars as single geoms.Dear aays, thank you for your suggestion. I will take a detailed look about this package