I have several groups of RNAseq data that I'm trying to compare to each other through ggplot in R. It consists of several columns of RPKM data each column a different group of samples. i.e., column 1: gene1 RPKMs in normal. Column 2:gene 1 RPKMs in tumor etc.
For example using a small excerpt of data
library(ggplot2)
df = read.table(text="G1 G1.1 G1.2 G1.3 G2 G2.1 G2.2 G2.3
1 0 3 4 3 2 3 1
2 'NA' 5 5 5 2 1 2
2 'NA' 2 1 2 1 2 5", header=TRUE)
dfmelt<-melt(df)
ggplot(dfmelt, aes(variable, value, fill=variable)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90))+
scale_x_discrete(labels=c('C1','C2','C3','C4','C5','C6','C7','C8'))+
scale_fill_manual(values=rep(c("red","green","blue","yellow"),2))+
stat_summary(fun.y = median, geom = "point", position = position_dodge(width = .9))+
scale_y_log10()
The problem occurs when I attempt to do boxplots of the data in ggplot2 and have it on a log10 y scale. Necessary due to the data distribution. Ggplot appears to simply drop zero values with the message
Removed x rows containing non-finite values (stat_boxplot)
Removed x rows containing missing values (stat_summary)
From what I've read ggplot attempts to take the log of 0 and comes up with -Inf so it drops it. Is this of concern in RNAseq expression analysis? If so how do I best handle it to get what I want without distorting the data?
just add a small number to all. Like 1
just add a small number to all. Like 1