Tutorial:Creating Interactive Volcano Plots with R and Plot.ly
0
41
Entering edit mode
8.2 years ago
steve ★ 3.5k

Experienced Bioinformaticians are probably familiar with the standard technique for creating volcano plots in R. Now there is a fun and interactive alternative available using the Plot.ly library. This code sample will demonstrate how to use this library to create an interactive plot.

To preview the full code + interactive plot, see my post here: https://stevekm.github.io/2016/09/24/Plot.ly-Volcano-Plot.html

I also have the same page listed in a Gist (minus the interactive plot output) here:

plotly R volcano-plot • 46k views
ADD COMMENT
2
Entering edit mode

I have the same problem the colnames(diff_df) looks like normal to me:

"external_gene_name" "Fold"               "FDR"                "group"

and head(diff_df) looks like this:

head(diff_df)
  external_gene_name     Fold      FDR                  group
1             CXCL12 3.471136 1.00e-20 Significant&FoldChange
2             ADORA1 3.175682 3.01e-32 Significant&FoldChange
3              KCNK5 2.953598 7.44e-14 Significant&FoldChange
4             MAMLD1 2.786176 6.65e-16 Significant&FoldChange
5              KCNQ5 2.777046 9.25e-14 Significant&FoldChange
6              RBBP8 2.758790 4.84e-45 Significant&FoldChange
ADD REPLY
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

The same problem here and the colnames(diff_df) looks the same as above in my case. I am still getting object not found error. Any suggestions?

ADD REPLY
0
Entering edit mode

I think you should ask a new question and add a link to this question from your new question. That would get you better results as the visibility would be better.

ADD REPLY
0
Entering edit mode

what exactly is the problem? Can you update the post to include the exact code you are running & the error message?

ADD REPLY
1
Entering edit mode

Hi and thank you for sharing this.

I usually use Trinity package that will draw Volcano plot for its DEG analysis (because I am new in R).

I will show you the related script and would you please kindly check them and see that is there any of the Trinity files that I can import them in R for drawing a volcano plot same as what you have provided?

Thank you in advance:

  $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
   --matrix counts.matrix --method DESeq2 --samples_file samples_described.txt

The output from running the DE analysis will reside in the output directory you specified, and if not, a default directory name that includes the name of the method used (ie. DESeq2/ ).

In this output directory, you'll find the following files for each of the pairwise comparisons performed:

1- ${prefix}.sampleA_vs_sampleB.${method}.DE_results : the DE analysis results,\ including log fold change and statistical significance (see FDR column).

2- ${prefix}.sampleA_vs_sampleB.${method}.MA_n_Volcano.pdf : MA and Volcano plots \ features found DE at FDR <0.05 will be colored red. Plots are shown\ with large (top) or small (bottom) points only for choice of aesthetics.

3- ${prefix}.sampleA_vs_sampleB.${method}.Rscript : the R-script executed to perform the DE analysis.

ADD REPLY
1
Entering edit mode

The files you are referencing are located on your computer and not accessible in this way through the Internet. However if you check in my post I have shown the data format needed here:

# preview the dataset; data required for the plot
head(diff_df)

##   external_gene_name  Fold      FDR
## 1      RP11-431K24.1 -4.13 1.04e-05
## 2              UBE4B  2.42 8.71e-06
## 3             UBIAD1  4.27 5.50e-06
## 4             UBIAD1  1.89 2.16e-04
## 5             UBIAD1  2.74 8.59e-05
## 6             UBIAD1  3.42 1.39e-08

What this means is that the code I posted should work with any data that is held in a text based format (.txt, .csv, .tsv file, for example) with the following 3 columns: gene, fold change, significance (p value, etc.). If you have data in a file like this, you can try to load it into R using a function such as read.delim() as I used here (my preferred function for this), or one of the other options listed here. Once you have the data loaded in R, all you need to do is select the columns with the relevant information and pass them to the different aspects of the code shown and it should work.

ADD REPLY
1
Entering edit mode

Dear steve, Hi and Thank you

The Trinity files has some more columns but I can make a new file only with the 3 columns you have kindly provided.

by the way, instead of "gene" there would be "Trinity IDs" in that file, but I think I can change some of the most biologically important annotated trinity IDs with the genes that blast has showed.

e.g. there are about 500k transcripts in the Trinity assembly and only 2000 of them are DEG and I just need to show 5 of them in each of my two condition in the volcano plot. What do you think about this approach?

NOTE: I knew that the files in my computer are not reachable by you and I just mentioned those names because I guessed that may be you are familiar with Trinity output files. sorry and Thank you again.

ADD REPLY
1
Entering edit mode

You can use R to subset the data as needed (what I showed here), or you can even do it without subsetting the data at all by passing the references to the entries. The file you want is probably the DE_results file. I would try to open that file in something like Excel or TextEdit to verify its format so you can figure out how to load it into R. In particular, you will need to find out what the separator between columns is (the 'delimiter'); this might not be visible from Excel but should be visible from a raw editor such as TextEdit or Notepad or Sublime, and corresponds to the sep argument of read.delim() or read.table(). If it is comma separated, it would be sep=',', if it is tab separated it would be sep='\t'

ADD REPLY
2
Entering edit mode

Dear Steve, With your step-by-step description I could draw the volcano plot for my data and it was very awesome and interesting. Thank you.

ADD REPLY
1
Entering edit mode

Dear steve, Hi.

Is there any way to show some genes other than the "five most DEG" on this volcano plot?

Because some times the top 5 genes are the genes related to some stress reactions or even disease and the

interesting biological DEGs situate at lover level of importance (e.g I have about 500 Upregulated DEGs for my

"male" condition and the important genes locate at the level of 2,, 52, 145, 302 positins on the DEG-result.matrix)

I hope I could describe the situation clearly.

ADD REPLY
1
Entering edit mode

yes, the way this example is set up, all you need to do is create a subset of the original dataframe that holds the genes you would like to label and use that in place of the top_peaks dataframe that I used here. The R code needed to obtain these entries will depend on your original data set. For example, if you don't mind hard-coding it and you know that you want entries 2, 52, 145, and 302, you could just do something like label_peaks <- diff_df[c(2, 52, 145, 302),] and pass this to the labeling list code.

ADD REPLY
0
Entering edit mode

Hi Steve,

Do you have the Rscript for this showing the up-and-down regulated transcripts using volcano? I already have pre-filtered data. My significant data is FDR<1e-5. My upregulated is log2FC>0 while my down is log2FC<0.

ADD REPLY
0
Entering edit mode

Not sure I understand the question, since the original post documents all the R code used in this example. The transcript regulation values were produced using DiffBind using the ChIP-Seq workflow from this pipeline

ADD REPLY
0
Entering edit mode

This looks nice. I understand that you can't render the interactive plot directly here but could you upload it somewhere and link it in your post ? So we can get a better idea of how interactive the plot truly is.

ADD REPLY
1
Entering edit mode

I put the HTML output in my DropBox and added a link. I am looking into trying to put the plot up on their public site. EDIT: Link replaced by this page which includes the interactive version of the plot.

ADD REPLY
0
Entering edit mode

After logging on dropbox, I had access to the html file. Its cool ! Thanks.

ADD REPLY
1
Entering edit mode

oh yes I think you can just click the 'X' on the login window that Dropbox gives and still access the file.

ADD REPLY
0
Entering edit mode

I have my dropbox open but don't know how to get the link from your dropbox account.

Just a simple quick question: does it mean that if I click on a dot on the plot, the gene name would show up and if I move the cursor elsewhere and the gene name disappears? That would be very cool.

ADD REPLY
0
Entering edit mode

Hi,

Yes that's how it is

ADD REPLY
0
Entering edit mode

I took down the Dropbox link and instead hosted the plot online at the URL listed in the post. The gene name will appear as you mouse over the points.

ADD REPLY
0
Entering edit mode

That looks incredibly cool!

I have something similar:

d <- read.table("mydata.txt", header = TRUE);
attach(d);
d$Significant <- ifelse(FDR < 0.25, "FDR < 0.25", "Not Sig")

ggplot(d, aes(x = logFC, y = -log10(PValue))) +
  geom_point(aes(color = Significant)) +
  scale_color_manual(values = c("blue", "grey")) +
  theme_bw(base_size = 16) +
  geom_text_repel(
    data = subset(d, (abs(logFC)>=4) & (FDR < 0.25)),
    aes(label = GENE_ID),
    size = 5,
    box.padding = unit(0.35, "lines"),
    point.padding = unit(0.3, "lines")
  )

The good part of it is that the gene names does not squeeze into each other and they look nice on the plot. The bad part is that it cannot do the interactive display. Suppose I just want to dynamically show the gene names when "mouse over", what should I do to change the code to do so?

Thanks much!

ADD REPLY
0
Entering edit mode

Dear Steve, Hi

I am recently receiving this error in my R-studio, would you please help

  • layout(annotations = a) Error in plot_ly(data = diff_df, x = Fold, y = -log10(FDR), text = external_gene_name, : object 'Fold' not found > p Error: ScalesList was built with an incompatible version of ggproto. Please reinstall the package that provides this extension.
ADD REPLY
1
Entering edit mode

Yes, that's an incompatibility error, we had the same with ggrepel and ggplot2, the best solution (so far, advice appreciated) is to crawl under your desk and cry.

ADD REPLY
0
Entering edit mode

Hi @WouterDeCoster,

But it was working well before !

Thanks for crawl and cry solution,

Is it any way to draw an non-interactive volcano plot with the same R script (maybe by removing some of its lines) ?

Do you have any experiences about THIS.

ADD REPLY
1
Entering edit mode

But it was working well before

Unfortunately I find myself in that situation often as well. Usually, some change in other installed system packages are to blame. For R, its frequently a change in the loaded version of gcc or zlib, though this may not be the case here.

The code posted in the OP here was designed specifically for Plotly, so you will not be able to use the same code verbatim for a base R plot, but since its essentially just a scatter plot it shouldn't need too many tweaks; my fallback would be the standard volcano plot code described here. If you want your plot to be available both in interactive and non-interactive forms from the same code, I would go with ggplot2 instead (see here), but of course that will not help if you are having ggplot2 compatibility issues.

ADD REPLY
0
Entering edit mode

Hi Stephen, this is amazingly helpful and is really helping with getting my head around R language for plotting.

I have a few ,what may be very simple questions. And feel free to point me towards other resources I should be reading.

1) How can I simply add a list of names that I specify from my gene name column, but maintain all the arrow information etc. I understand there must be a simple way of populating list a, with selected gene names. But I am too new to this to know how to do that.

2) is there a simple way of adding dashed lines to the graph where the cut-offs are.

3) is there a simple way to split the axis. I am thinking the y axis in my case, but am sure it would be similar for the x.

Thanks again,

It's been a great help already.

BW

ADD REPLY
1
Entering edit mode
  1. Are you referring to the hard-coded black arrows on the plot? In this example, that is coming from the top_peaks data frame; you can substitute this for any df that has the gene name & values, just tweak the for loop shown to put the Fold Change, FDR, and Gene Name in the x, y, and text fields respectively in each entry in the list being built and pass it to the plot layout.

2 & 3. I honestly do not know off the top of my head, since writing this example I have moved almost all of my plotting to use ggplot2 instead. I find it to be advantageous because it makes much nicer 2-D plots, and can be imported directly into Plotly without modification. It looks like the Plotly docs have moved here, which might show you how to accomplish something like this using Plotly directly as in this example, or you could try building a similar plot in ggplot2 and import it to Plotly that way.

ADD REPLY
0
Entering edit mode

Great advice, just spent the morning honing my ggplot2 skills.

Works nicely, thanks

ADD REPLY
0
Entering edit mode

Hello Steve and thank you very much.

I am a very new user to R and I am following your script to create a volcano plot of some proteomic data that I have. All the commands are being accepted by the software but just at the very end, it pops up some errors. These are types of errors like the following one:

> # make the Plot.ly plot
> p <- plot_ly(data = diff_df, x = Fold, y = -log10(p.value), text = external_gene_name, mode = "markers", color = "group") %>% 
+   layout(title ="DiffBind Volcano Plot")
Error in plot_ly(data = diff_df, x = Fold, y = -log10(p.value), text = external_gene_name,  : 
  object 'Fold' not found

So what I did was to modify it a little bit by adding a few "" and I manage to get a plot but it doen't look like a volcano. After my modifications the code is like that:

p <- plot_ly(data = diff_df, x = "Fold", y = "FDR", text = "external_gene_name", mode = "markers", color = "group") %>% 
  layout(title ="Volcano Plot") %>%
  layout(annotations = a)
p

Could you please help me with that? Thank you very much in advance.

Sincerely, Yiannis

ADD REPLY
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

x = Fold is referring to the name of a column in the data frame. object 'Fold' not found could mean that column is not present in diff_df. You might want to check head(diff_df) and colnames(diff_df) to make sure it resembles what I showed in the example, and that all the columns being used in the plotting function are present. If it still does not look the way you expected, posting a screenshot here may help.

ADD REPLY
0
Entering edit mode

I am also facing same issue. Please help:

input_file <- "C:/Users/dverma2/Desktop/TCGA_plots_data/volcanoplot/res1.csv"

diff_df <- read.delim(file = input_file,header = TRUE,sep = ',')

colnames(diff_df)
[1] "external_gene_name" "baseMean"           "Fold"               "lfcSE"              "stat"              
[6] "pvalue"             "FDR"               

 diff_df <- diff_df[c("external_gene_name", "Fold", "FDR")]
 head(diff_df)
  external_gene_name     Fold      FDR
1    ENSG00000225972 6.816421 5.63e-84
2    ENSG00000262902 4.775611 8.30e-47
3    ENSG00000230916 4.650652 6.43e-40
4    ENSG00000232177 3.747488 1.11e-35
5    ENSG00000229604 4.151809 6.06e-27
6    ENSG00000253239 4.365511 2.05e-22

diff_df["group"] <- "NotSignificant"
diff_df[which(diff_df['FDR'] < 0.05 & abs(diff_df['Fold']) < 1.5 ),"group"] <- "Significant"
diff_df[which(diff_df['FDR'] > 0.05 & abs(diff_df['Fold']) > 1.5 ),"group"] <- "FoldChange"
 diff_df[which(diff_df['FDR'] < 0.05 & abs(diff_df['Fold']) > 1.5 ),"group"] <- "Significant&FoldChange"
top_peaks <- diff_df[with(diff_df, order(Fold, FDR)),][1:5,]
top_peaks <- rbind(top_peaks, diff_df[with(diff_df, order(-Fold, FDR)),][1:5,])
a <- list()
 for (i in seq_len(nrow(top_peaks))) {
+   m <- top_peaks[i, ]
+   a[[i]] <- list(
+     x = m[["Fold"]],
+     y = -log10(m[["FDR"]]),
+     text = m[["external_gene_name"]],
+     xref = "x",
+     yref = "y",
+     showarrow = TRUE,
+     arrowhead = 0.5,
+     ax = 20,
+     ay = -40
+   )
+ }

p <- plot_ly(data = diff_df, x = Fold, y = -log10(FDR), text = external_gene_name, mode = "markers", color = group) %>% 
+   layout(title ="Volcano Plot") %>%
+   layout(annotations = a)
Error in plot_ly(data = diff_df, x = Fold, y = -log10(FDR), text = external_gene_name,  : 
  object 'Fold' not found
ADD REPLY
0
Entering edit mode

The error implies that there is no column called 'Fold' in your input data object.

If you still have issues, please try EnhancedVolcano in R / Bioconductor.

ADD REPLY
0
Entering edit mode

I dont have P Values, than how to calculate that.

ADD REPLY
0
Entering edit mode

Don't do this. If you have a question, open a new post. If you wish to add a comment, use ADD COMMENT.

ADD REPLY
0
Entering edit mode

How to export(save as excel file) upregulated gene from volcano plot and downregulated gene separately.

ADD REPLY
2
Entering edit mode

In R, you would use the write.table function to save a copy of the dataframe (diff_df in this case) after filtering it for the up and down regulated genes (examples here).

ADD REPLY
0
Entering edit mode

Dear Dr.Kelly,

I have executed the above script with the Gfold values (similar to logfold2 change). since we are working with non replicated data, we don't have a p value. Thus, I would like to have +fold change on x axis and -fold change on y axis. I am able to get the graph but a regular scatter plot. I get the error message as below: No trace type specified: Based on info supplied, a 'scatter' trace seems appropriate. Read more about this trace type -> https://plot.ly/r/reference/#scatter.

Kindly help me to trouble shoot.

-Dr. Megha H.S

ADD REPLY
0
Entering edit mode

Is there a straightforward way to change the colors in the grouping?

ADD REPLY
0
Entering edit mode

I am trying to run the same script but getting one error at the last step.

>  Rscript Volcano_Plot_Ploty.R 
[1] "Gene"   "Fold"   "pvalue"
[1] 524   3
     Gene     Fold  pvalue
1   LUZP1 -7.57373 0.00850
2 FAM71F1 -6.31787 0.00865
3    MBD4 -5.81446 0.00870
4  HOXD13  4.85073 0.02340
5  DNAJC2 -2.75493 0.03140
6    CHD4 -2.49614 0.05700
Error in plot_ly(data = diff_df, x = Fold, y = -log10(pvalue), text = Gene,  : 
  object 'Fold' not found
Calls: %>% -> eval -> eval -> plot_ly
Execution halted

any suggestion is most welcome.

ADD REPLY
1
Entering edit mode

sounds like your object diff_df is missing column 'Fold'. I would step through it line by line in Rstudio and double check that nothing is missing.

ADD REPLY
0
Entering edit mode

During the run of script, I am printing the data which u can see i have added with error msg which came on terminal. So clearly i m seeing 524 rows and 3 columns are present . what could be error since tried executing line by line and still facing same issue.

ADD REPLY
0
Entering edit mode

This is what i am trying to run

suppressPackageStartupMessages(library("plotly"))
diff_df <- read.delim(file = "Input_list_HF_LF.txt",header = TRUE,sep = '\t')
diff_df["group"] <- "NotSignificant"
diff_df[which(diff_df['pvalue'] < 0.05 & abs(diff_df['Fold']) < 1.5 ),"group"] <- "Significant"
diff_df[which(diff_df['pvalue'] > 0.05 & abs(diff_df['Fold']) > 1.5 ),"group"] <- "FoldChange"
diff_df[which(diff_df['pvalue'] < 0.05 & abs(diff_df['Fold']) > 1.5 ),"group"] <- "Significant&FoldChange"
top_peaks <- diff_df[with(diff_df, order(Fold, pvalue)),][1:5,]
top_peaks <- rbind(top_peaks, diff_df[with(diff_df, order(-Fold, pvalue)),][1:5,])
a <- list()
for (i in seq_len(nrow(top_peaks))) {
  m <- top_peaks[i, ]
  a[[i]] <- list(
    x = m[["Fold"]],
    y = -log10(m[["pvalue"]]),
    text = m[["Gene"]],
    xref = "x",
    yref = "y",
    showarrow = TRUE,
    arrowhead = 0.5,
    ax = 20,
    ay = -40
  )
}

p <- plot_ly(data = diff_df, x = Fold, y = -log10(pvalue), text = Gene, mode = "markers", color = group) %>% 
  layout(title ="Volcano Plot") %>%
  layout(annotations = a)

My file looks like this (tab separated, formatted for display below)

Gene                Fold        pvalue
LUZP1               -7.57373    0.0085
FAM71F1             -6.31787    0.00865
MBD4                -5.81446    0.0087
HOXD13              4.85073     0.0234
DNAJC2              -2.75493    0.0314
CHD4                -2.49614    0.057
EEA1                -5.73976    0.0623
ENSBTAG00000043567  -4.13538    0.07295
TMEM220             -2.10145    0.0914
NHLH2               -1.7285     0.10305

Please help me out

I am still facing same error.

ADD REPLY
1
Entering edit mode

It works for me after little modification. I have put "" and for the object Fold and FDR, I have done something like this:

p <- plot_ly(data = diff_df, x = diff_df$Fold, y = -log10(diff_df$FDR), text = "external_gene_name", mode = "markers", color = "group")
ADD REPLY
0
Entering edit mode
p <- plot_ly(data = diff_df, x = diff_df$Fold, y = -log10(diff_df$FDR), text = diff_df$external_gene_name, mode = "markers", color = diff_df$group) %>% layout(title = "Volcano plot") %>%
  layout(annotations = a) 

p
ADD REPLY
0
Entering edit mode

Please tidy your code with the 101 010 button

ADD REPLY
0
Entering edit mode

Like Kevin said, please tidy up your code. Also, why did you add this as an answer?

ADD REPLY
0
Entering edit mode

How do I specify colors of specific groups?

Also I've tried to label upregulated and downregulated groups separately but it doesn't work correctly:

This works:

diff_df[which(diff_df['log10pvalue'] > 1.30103 & abs(diff_df['logFC']) > 0 ),"group"] <- "Upregulated"

But this does not change values to "Downregulated":

diff_df[which(diff_df['log10pvalue'] > 1.30103 & abs(diff_df['logFC']) < 0 ),"group"] <- "Downregulated"
ADD REPLY
0
Entering edit mode

What do you mean "doesn't work correctly"? Are there any entries that match your downregulated conditions?

ADD REPLY
0
Entering edit mode

With the first command, all entries passing the log10pvalue threshold are labeled as upregulated (regardless of logFC value). However, when I try the second command, the hundreds of entries with negative logFC values in my dataset do not change grouping to downregulated and all remain as "Upregulated."

diff_df[which(diff_df['log10pvalue'] > 1.30103 & abs(diff_df['logFC']) < 0 ),"group"] <- "Downregulated"

head(diff_df)

     ID      logFC   Pvalue    log10pvalue    group
1  Foxr2  1.5446199 1.41e-86    85.81721   Upregulated
2  Irs4   1.5465623 1.72e-69    69.76447   Upregulated
3  Gapdh -2.1089791 1.44e-58    57.84264   Upregulated
4  Grpr  -3.5870946 5.22e-47    46.26600   Upregulated
5  Alx3   0.9618514 2.12e-39    38.66154   Upregulated
ADD REPLY

Login before adding your answer.

Traffic: 1090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6