make volcano plot from multiple dataframes with same genes and different foldchanges
2
0
Entering edit mode
2.0 years ago
Chironex ▴ 50

My question is probably not new, but I haven't found fully satisfactory answers. I have multiple dataframes with the same columns, each one represents a cluster, so the genes are the same but they are expressed differently so with different fold changes and pvalues. I would like to create a volcano plot that combines all these dataframes and plots them together, coloring the genes differently in order to understand which cluster they belong to. the point is that the genes will not be unique for just one dataframe but will be often repeated. is it possible to do this?

r • 1.9k views
ADD COMMENT
2
Entering edit mode
2.0 years ago
seidel 11k

It's quite possible to do what you want. The base plot functions in R make overplotting easy (adding sets of points to a plot layer by layer). If your data frames are in a list, you can use the apply family of functions. Here's a not pretty example using mapply() which can iterate through two things at once (a list of data frames, and a vector of colors).

# create a list of data frames 
# each with uniquely skewed data
df_list <- lapply(as.list(1:5), function(x){
  x <- rnorm(50) + rnorm(1,0,2)
  y <- abs(rnorm(50))
  df <- data.frame(ex=x, sig=y)
  rownames(df) <- paste0("g",1:50)
  return(df)
})

# create a blank plot
plot(1,1, type="n", xlim=c(-4,4), ylim=c(0,3), xlab="logFC", ylab="sig")

# create some colors
plotcolors <- rainbow(5)
# iterate through df_list and plot the points for each data frame
mapply(function(df,p){
  points(df$ex, df$sig, col=p, pch=19)  
}, df_list, plotcolors)

The first section uses apply to create a list of 5 dataframes, each with 2 columns containing 50 data points vaguely resembling volcano plot-like data skewed in a given direction. In this case, each data frame has the same set of genes (but it doesn't matter what the gene names are). The second part uses mapply() to loop through the list of dataframes, and the vector of plot colors, to draw data points on the plot.

Of course, if you had a few data frames and wanted to simply plot them one by one (no loop), it's straightforward:

# plot all the data
plot(logFC, significance, col="grey")
# plot your first df cluster: df1
points(df1$logFC, df1$significance, col=yourFavoriteColor1)
# plot your second df cluster: df2
points(df2$logFC, df2$significance, col=yourFavoriteColor2)
# etc.
ADD COMMENT
0
Entering edit mode

Hi, thank you for the answer, this is exactly what I was looking for. For a better visualization, is it possible to use it in ggplot?

ADD REPLY
1
Entering edit mode

Yes, ggplot is possible, but then you have the perpetual riddle of getting the data into the right arrangement for ggplot, as @MingTang mentions. Which is just a different kind of problem. My brain doesn't solve those naturally without going for a walk. But I'm sure it's easy for some people here.

ADD REPLY
1
Entering edit mode
2.0 years ago
Ming Tommy Tang ★ 4.5k

you will need to prefix the gene name with the cluster id, and you can then just concatenate all dataframe and plot as usual.

ADD COMMENT

Login before adding your answer.

Traffic: 3061 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6