Entering edit mode
8.2 years ago
steve
★
3.5k
Experienced Bioinformaticians are probably familiar with the standard technique for creating volcano plots in R. Now there is a fun and interactive alternative available using the Plot.ly library. This code sample will demonstrate how to use this library to create an interactive plot.
To preview the full code + interactive plot, see my post here: https://stevekm.github.io/2016/09/24/Plot.ly-Volcano-Plot.html
I also have the same page listed in a Gist (minus the interactive plot output) here:
I have the same problem the colnames(diff_df) looks like normal to me:
and head(diff_df) looks like this:
I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
The same problem here and the colnames(diff_df) looks the same as above in my case. I am still getting object not found error. Any suggestions?
I think you should ask a new question and add a link to this question from your new question. That would get you better results as the visibility would be better.
what exactly is the problem? Can you update the post to include the exact code you are running & the error message?
Hi and thank you for sharing this.
I usually use Trinity package that will draw Volcano plot for its DEG analysis (because I am new in R).
I will show you the related script and would you please kindly check them and see that is there any of the Trinity files that I can import them in R for drawing a volcano plot same as what you have provided?
Thank you in advance:
The output from running the DE analysis will reside in the output directory you specified, and if not, a default directory name that includes the name of the method used (ie. DESeq2/ ).
In this output directory, you'll find the following files for each of the pairwise comparisons performed:
1- ${prefix}.sampleA_vs_sampleB.${method}.DE_results : the DE analysis results,\ including log fold change and statistical significance (see FDR column).
2- ${prefix}.sampleA_vs_sampleB.${method}.MA_n_Volcano.pdf : MA and Volcano plots \ features found DE at FDR <0.05 will be colored red. Plots are shown\ with large (top) or small (bottom) points only for choice of aesthetics.
3- ${prefix}.sampleA_vs_sampleB.${method}.Rscript : the R-script executed to perform the DE analysis.
The files you are referencing are located on your computer and not accessible in this way through the Internet. However if you check in my post I have shown the data format needed here:
What this means is that the code I posted should work with any data that is held in a text based format (.txt, .csv, .tsv file, for example) with the following 3 columns: gene, fold change, significance (p value, etc.). If you have data in a file like this, you can try to load it into R using a function such as
read.delim()
as I used here (my preferred function for this), or one of the other options listed here. Once you have the data loaded in R, all you need to do is select the columns with the relevant information and pass them to the different aspects of the code shown and it should work.Dear steve, Hi and Thank you
The Trinity files has some more columns but I can make a new file only with the 3 columns you have kindly provided.
by the way, instead of "gene" there would be "Trinity IDs" in that file, but I think I can change some of the most biologically important annotated trinity IDs with the genes that blast has showed.
e.g. there are about 500k transcripts in the Trinity assembly and only 2000 of them are DEG and I just need to show 5 of them in each of my two condition in the volcano plot. What do you think about this approach?
NOTE: I knew that the files in my computer are not reachable by you and I just mentioned those names because I guessed that may be you are familiar with Trinity output files. sorry and Thank you again.
You can use R to subset the data as needed (what I showed here), or you can even do it without subsetting the data at all by passing the references to the entries. The file you want is probably the
DE_results
file. I would try to open that file in something like Excel or TextEdit to verify its format so you can figure out how to load it into R. In particular, you will need to find out what the separator between columns is (the 'delimiter'); this might not be visible from Excel but should be visible from a raw editor such as TextEdit or Notepad or Sublime, and corresponds to thesep
argument ofread.delim()
orread.table()
. If it is comma separated, it would besep=','
, if it is tab separated it would besep='\t'
Dear Steve, With your step-by-step description I could draw the volcano plot for my data and it was very awesome and interesting. Thank you.
Dear steve, Hi.
Is there any way to show some genes other than the "five most DEG" on this volcano plot?
Because some times the top 5 genes are the genes related to some stress reactions or even disease and the
interesting biological DEGs situate at lover level of importance (e.g I have about 500 Upregulated DEGs for my
"male" condition and the important genes locate at the level of 2,, 52, 145, 302 positins on the DEG-result.matrix)
I hope I could describe the situation clearly.
yes, the way this example is set up, all you need to do is create a subset of the original dataframe that holds the genes you would like to label and use that in place of the
top_peaks
dataframe that I used here. The R code needed to obtain these entries will depend on your original data set. For example, if you don't mind hard-coding it and you know that you want entries 2, 52, 145, and 302, you could just do something likelabel_peaks <- diff_df[c(2, 52, 145, 302),]
and pass this to the labeling list code.Hi Steve,
Do you have the Rscript for this showing the up-and-down regulated transcripts using volcano? I already have pre-filtered data. My significant data is FDR<1e-5. My upregulated is log2FC>0 while my down is log2FC<0.
Not sure I understand the question, since the original post documents all the R code used in this example. The transcript regulation values were produced using DiffBind using the ChIP-Seq workflow from this pipeline
This looks nice. I understand that you can't render the interactive plot directly here but could you upload it somewhere and link it in your post ? So we can get a better idea of how interactive the plot truly is.
I put the HTML output in my DropBox and added a link. I am looking into trying to put the plot up on their public site. EDIT: Link replaced by this page which includes the interactive version of the plot.
After logging on dropbox, I had access to the html file. Its cool ! Thanks.
oh yes I think you can just click the 'X' on the login window that Dropbox gives and still access the file.
I have my dropbox open but don't know how to get the link from your dropbox account.
Just a simple quick question: does it mean that if I click on a dot on the plot, the gene name would show up and if I move the cursor elsewhere and the gene name disappears? That would be very cool.
Hi,
Yes that's how it is
I took down the Dropbox link and instead hosted the plot online at the URL listed in the post. The gene name will appear as you mouse over the points.
That looks incredibly cool!
I have something similar:
The good part of it is that the gene names does not squeeze into each other and they look nice on the plot. The bad part is that it cannot do the interactive display. Suppose I just want to dynamically show the gene names when "mouse over", what should I do to change the code to do so?
Thanks much!
Dear Steve, Hi
I am recently receiving this error in my R-studio, would you please help
Yes, that's an incompatibility error, we had the same with ggrepel and ggplot2, the best solution (so far, advice appreciated) is to crawl under your desk and cry.
Hi @WouterDeCoster,
But it was working well before !
Thanks for crawl and cry solution,
Is it any way to draw an non-interactive volcano plot with the same R script (maybe by removing some of its lines) ?
Do you have any experiences about THIS.
Unfortunately I find myself in that situation often as well. Usually, some change in other installed system packages are to blame. For R, its frequently a change in the loaded version of
gcc
orzlib
, though this may not be the case here.The code posted in the OP here was designed specifically for Plotly, so you will not be able to use the same code verbatim for a base R plot, but since its essentially just a scatter plot it shouldn't need too many tweaks; my fallback would be the standard volcano plot code described here. If you want your plot to be available both in interactive and non-interactive forms from the same code, I would go with
ggplot2
instead (see here), but of course that will not help if you are having ggplot2 compatibility issues.Hi Stephen, this is amazingly helpful and is really helping with getting my head around R language for plotting.
I have a few ,what may be very simple questions. And feel free to point me towards other resources I should be reading.
1) How can I simply add a list of names that I specify from my gene name column, but maintain all the arrow information etc. I understand there must be a simple way of populating list a, with selected gene names. But I am too new to this to know how to do that.
2) is there a simple way of adding dashed lines to the graph where the cut-offs are.
3) is there a simple way to split the axis. I am thinking the y axis in my case, but am sure it would be similar for the x.
Thanks again,
It's been a great help already.
BW
top_peaks
data frame; you can substitute this for any df that has the gene name & values, just tweak thefor
loop shown to put the Fold Change, FDR, and Gene Name in thex
,y
, andtext
fields respectively in each entry in the list being built and pass it to the plot layout.2 & 3. I honestly do not know off the top of my head, since writing this example I have moved almost all of my plotting to use
ggplot2
instead. I find it to be advantageous because it makes much nicer 2-D plots, and can be imported directly into Plotly without modification. It looks like the Plotly docs have moved here, which might show you how to accomplish something like this using Plotly directly as in this example, or you could try building a similar plot inggplot2
and import it to Plotly that way.Great advice, just spent the morning honing my ggplot2 skills.
Works nicely, thanks
Hello Steve and thank you very much.
I am a very new user to R and I am following your script to create a volcano plot of some proteomic data that I have. All the commands are being accepted by the software but just at the very end, it pops up some errors. These are types of errors like the following one:
So what I did was to modify it a little bit by adding a few "" and I manage to get a plot but it doen't look like a volcano. After my modifications the code is like that:
Could you please help me with that? Thank you very much in advance.
Sincerely, Yiannis
I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
x = Fold
is referring to the name of a column in the data frame.object 'Fold' not found
could mean that column is not present indiff_df
. You might want to checkhead(diff_df)
andcolnames(diff_df)
to make sure it resembles what I showed in the example, and that all the columns being used in the plotting function are present. If it still does not look the way you expected, posting a screenshot here may help.I am also facing same issue. Please help:
The error implies that there is no column called 'Fold' in your input data object.
If you still have issues, please try EnhancedVolcano in R / Bioconductor.
I dont have P Values, than how to calculate that.
Don't do this. If you have a question, open a new post. If you wish to add a comment, use
ADD COMMENT
.How to export(save as excel file) upregulated gene from volcano plot and downregulated gene separately.
In R, you would use the
write.table
function to save a copy of the dataframe (diff_df
in this case) after filtering it for the up and down regulated genes (examples here).Dear Dr.Kelly,
I have executed the above script with the Gfold values (similar to logfold2 change). since we are working with non replicated data, we don't have a p value. Thus, I would like to have +fold change on x axis and -fold change on y axis. I am able to get the graph but a regular scatter plot. I get the error message as below: No trace type specified: Based on info supplied, a 'scatter' trace seems appropriate. Read more about this trace type -> https://plot.ly/r/reference/#scatter.
Kindly help me to trouble shoot.
-Dr. Megha H.S
Is there a straightforward way to change the colors in the grouping?
I am trying to run the same script but getting one error at the last step.
any suggestion is most welcome.
sounds like your object
diff_df
is missing column 'Fold'. I would step through it line by line in Rstudio and double check that nothing is missing.During the run of script, I am printing the data which u can see i have added with error msg which came on terminal. So clearly i m seeing 524 rows and 3 columns are present . what could be error since tried executing line by line and still facing same issue.
This is what i am trying to run
My file looks like this (tab separated, formatted for display below)
Please help me out
I am still facing same error.
It works for me after little modification. I have put "" and for the object Fold and FDR, I have done something like this:
Please tidy your code with the
101 010
buttonLike Kevin said, please tidy up your code. Also, why did you add this as an answer?
How do I specify colors of specific groups?
Also I've tried to label upregulated and downregulated groups separately but it doesn't work correctly:
This works:
But this does not change values to "Downregulated":
What do you mean "doesn't work correctly"? Are there any entries that match your downregulated conditions?
With the first command, all entries passing the log10pvalue threshold are labeled as upregulated (regardless of logFC value). However, when I try the second command, the hundreds of entries with negative logFC values in my dataset do not change grouping to downregulated and all remain as "Upregulated."