How to show both edgeR and deseq2 results in a single volcano plot; highliting overlaps
2
4
Entering edit mode
6.7 years ago
Farbod ★ 3.4k

Dear Biostars, Hi.

I there any way to show the edgeR and DESeq2 DEG results in one single volcano, AND showing their overlaps in a different colour (e.g. Non-significant= blue, significant = red, significant overlapped between both package= orange) ?

Note: This is the code I usually use for drawing my volcano plot in Rsudio but I only can feed one counts.matrix file to it, each time.

library(ggplot2)
data <- read.table("trans.counts.matrix.J_vs_M.edgeR.DE_results", header=TRUE)
##Identify the genes that have a FDR < 0.001
data$threshold = as.factor(data$FDR < 0.001)
##Construct the plot object
g <- ggplot(data=data, 
            aes(x=logFC, y =-log10(FDR), 
                colour=threshold)) +

  geom_point(alpha=0.4, size=3) +
  xlim(c(-14, 14)) +
  xlab("log2 fold change") + ylab("-log10 FDR") +
  theme_bw() +
  scale_color_manual(values = c("blue", "red")) +
  theme(legend.position="none")
g
RNA-Seq R DEG volcano plot • 5.8k views
ADD COMMENT
0
Entering edit mode

Please provide example data.

ADD REPLY
0
Entering edit mode

Dear zx8754. Hi and thank you for your help.

I did not get the point clearly about "example data". If you mean counts.matrix structure, it could be any counts, but the head of my "data" in the code above is as below"

> head (data)
         external_gene_name     logFC   logCPM      P.Value          FDR threshold
1 TRINITY_DN179827_c0_g2_i4 -13.87327 4.618556 3.038112e-30 1.878978e-24      TRUE
2 TRINITY_DN194528_c5_g3_i4  13.26971 4.016180 1.219369e-27 3.770709e-22      TRUE
3 TRINITY_DN192883_c9_g2_i2  11.22692 3.821780 3.201530e-26 6.600158e-21      TRUE
4 TRINITY_DN185936_c0_g1_i1  12.85547 3.603356 2.980189e-24 4.607886e-19      TRUE
5 TRINITY_DN180711_c0_g1_i1 -11.87279 2.627359 1.814995e-20 2.245037e-15      TRUE
6 TRINITY_DN197264_c6_g1_i6  10.32434 2.894909 2.161066e-19 2.227587e-14      TRUE
ADD REPLY
0
Entering edit mode

For a given gene, how will you illustrate the connection between its result in DESeq and it's (dependent) result in edgeR?

ADD REPLY
0
Entering edit mode

A gene A in edgeR and DESeq will get twice a logFC and a p-value, so you cannot plot these at the same time using a volcano plot.

ADD REPLY
2
Entering edit mode

yes you can, you just plot two different points for the same gene. Whether it's meaningful to do so is up for discussion

ADD REPLY
0
Entering edit mode

Dear russhh and Wouter, Hi. Here is the problem

I have done DEG analysis by both edgeR and DESeq2 and I checked the overlaps of both package in my both conditions (condition1 and condition2) using a venn diagram and I tried to annotate the common/overlap transcripts reported in both exact test and GLM method. I also checked the results using SARTools. So, now I have two volcano for edgeR and DESeq2 that have about 30% overlap of DEGs in condition1 and 50% overlap in condition2.

I was wondering if I can show all these in a single volcano plot using 3 colours.

ADD REPLY
0
Entering edit mode

sorry, is it two different contrasts each of which has been tested using both edgeR and DESeq2?

ADD REPLY
0
Entering edit mode

I have used the same data and same expressin counts resulted from Trinity for condition1 and condition2 for both edgeR and DESeq2 and focused on the overlaps base on this idea if multiple programs give you the same results, then you can be confident that those results do not depend on the particular assumptions that are made by the programs/statistic tests.

ADD REPLY
0
Entering edit mode

I think the two methods are a bit too similar for that to be appropriate; I'd agree if you were comparing voom and DESeq2. I'd agree with you if you'd subsampled either the counts or the exons and ran the two methods on the two partitions. Not sure if the overlayed volcano would be of value to help choose between the methods - I'd rather see two MA plots stacked next to each other

ADD REPLY
0
Entering edit mode

Good idea! I will check and compare MA plots, too. Aren't MA plots similar to Volcano plots, in a whole perspective?

ADD REPLY
0
Entering edit mode

Not really, it will show how your diffexes vary across different levels of average expression for the two different methods

ADD REPLY
0
Entering edit mode

Okay you can, but no, I don't consider it meaningful :)

When plotting, you need to think what the message is that you want to give. In this case, that's quite unclear.

ADD REPLY
0
Entering edit mode

If you are asking me,

by this approach you (I) can first change the way of visualising the results (converting two different volcano and a venn diagram to only one volcano)

and also make it clear for the readers that the core DEGs you have chosen in the most conservative approach are significantly differentially expressed in both two statistical packages.

ADD REPLY
5
Entering edit mode
6.7 years ago

Try this in basic plotting: (note: I created two dataframes with 100 genes and each dataframe shares 10 common genes with identical FDR and logFC). Common genes are colored in red . and labelled, rest in green and light green.

output: Rplot01

input:

set.seed(100)
# Create a dataframe by name edge
edge = data.frame(
  gene = paste0("gene",sample(100)),
  logFC = c(rnorm(80,0,1),rnorm(20,0,12)),
  logFDR = rnorm(100,mean=0.05, sd=0.02)
)
# Create a dataframe by name dsq
dsq =data.frame(
  gene = paste0("gene",sample(100)),
  logFC = c(rnorm(70,0,1),rnorm(30,0,12)),
  logFDR = rnorm(100,mean=0.05, sd=0.01)
)

set.seed(200)
## Select random 10 genes from edge and their FC and FDR values

cf=edge[abs(edge$logFC)>2,][sample(nrow(edge[abs(edge$logFC)>2,]),10),]
#View(cf)

## Replace same genes in dsq with values above so that edge and dsq share 10 genes with identical FC and FDR.
dsq[dsq$gene %in% cf$gene,]=cf

## sort the dataframes
edge.sorted=edge[with(edge,order(gene)),]
dsq.sorted=dsq[with(dsq,order(gene)),]

## plot first dataframe
plot(
  x = edge.sorted$logFC,
  y = edge.sorted$logFDR,
  col = "darkgreen",
  pch = 16,
  cex=2,
  xlab="Log(2) Fold Change",
  ylab="FDR",
  abline(v=c(-2,2),h=c(0,0.05), col="red", lty=3,lwd=3)
)
## plot second data frame over first plot
points(
  x = dsq.sorted$logFC,
  y = dsq.sorted$logFDR,
  col = "green",
  pch = 16,
  cex=2
)
## Highlight points of interest
points(
    x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
    y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
  col = "red",
  pch = 16,
  cex=2
)
## Add labels to points of interest
text(
  x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
  y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
  edge.sorted$gene[edge.sorted$logFC == dsq.sorted$logFC] ,
  cex = 2,
  pos=1,
  col = "red"
)

in ggplot same (recycled code from @russh for merged dataframe creation, dataframe is code from above post):

output: Rplot01 input:

# Load libraries
library(dplyr)
library(ggplot2)

# Merge data frames by gene name
dfm = merge(edge.sorted, dsq.sorted, by = "gene")

# Create a data frame for common genes by logFC and logFDR
dfm1 = dfm[dfm$logFC.x == dfm$logFC.y & dfm$logFDR.x == dfm$logFDR.y, ]
dfm1
head(dfm)
head(dfm1)
## plot
ggplot(dfm) +
  geom_point(
    data = dfm,
    aes(x = logFC.x, y = logFDR.x),
    color = "green",
    cex = 3
  ) +
  geom_point(
    data = dfm,
    aes(x = logFC.y, y = logFDR.y),
    color = "lightgreen",
    cex = 3
  ) +
  geom_point(
    data = dfm1,
    aes(x = logFC.x, y = logFDR.x),
    color = "blue",
    cex = 3
  ) +
  geom_text(
    data = dfm1,
    aes(x = logFC.x, y = logFDR.x, label = gene),
    hjust = 1,
    vjust = 2
  ) +
  theme_bw() +
  xlab("Log(2) fold change") +
  ylab("FDR") +
  geom_vline(
    xintercept = 2,
    col = "red",
    linetype = "dotted",
    size = 1
  ) +
  geom_vline(
    xintercept = -2,
    col = "red",
    linetype = "dotted",
    size = 1
  ) +
  geom_hline(
    yintercept = 0.05,
    col = "red",
    linetype = "dotted",
    size = 1
  )
ADD COMMENT
0
Entering edit mode

sure @ admin

ADD REPLY
1
Entering edit mode

Great coding as usual cpad

ADD REPLY
1
Entering edit mode

I don't quite understand the filtering. Surely it's extremely unlikely that the fold-changes and the FDRs match between the DESeq and the edgeR runs isn't it?

ADD REPLY
1
Entering edit mode

It is very unlikely that FDRs and FCs match between ER and DS2 exactly. However, above one is simulated data, not a true data set. However OP wants to over lay. Probably OP has a way to scale the results to comparable results (just my guess). (IMO)

ADD REPLY
0
Entering edit mode

Dear rushhh You are right,

the FDRs and FC are not the same numbers in two approaches, but the order of IDs are somehow the same. first DEG in cond1 in deseq2 is same as first DEG in cond1 in edgeR and so on.

is it helpful?

ADD REPLY
0
Entering edit mode

Thanks @Kevin

ADD REPLY
0
Entering edit mode

I moved your comment to an answer, could you please merge your last comment to this answer?

ADD REPLY
0
Entering edit mode

Dear cpad0112, Hi and thank you.

I will try your valuable code and ask some more question then, as R is always not so friendly to me.

ADD REPLY
2
Entering edit mode
6.7 years ago
russhh 5.7k

If it were me, I'd link the edgeR and DESeq results using coloured edges, rather than colour the points:

edg <- data.frame(
    gene_name = letters[1:3],
    logFC = rnorm(3),
    logFDR = c(-10, -3, -1),
    threshold = c(T, T, F)
    )

dsq <- data.frame(
    gene_name = letters[1:4],
    logFC = rnorm(4),
    logFDR = c(-8, -4, -2, -5),
    threshold = c(T, T, F, T)
    )

merge(edg, dsq, by = "gene_name") %>%
    filter(threshold.x & threshold.y) %>%
    ggplot() +
    geom_segment(aes(x = logFC.x, y = logFDR.x, xend = logFC.y, yend = logFDR.y, group = gene_name), col = "orange") +
    geom_point(data = edg, aes(x = logFC, y = logFDR), alpha = 0.1) +
    geom_point(data = dsq, aes(x = logFC, y = logFDR), alpha = 0.1)
ADD COMMENT
0
Entering edit mode

Thank you for your valuable script. I will try it and share the results/problem with you then.

ADD REPLY
0
Entering edit mode

I received this error: Error in eval(expr, envir, enclos) : object 'logFC.x' not found

ADD REPLY
0
Entering edit mode

@Farbod: I guess you need to load libraries dplyr and ggplot2. On the top of the code, add

library(ggplot2)
library(dplyr)
ADD REPLY
1
Entering edit mode

@Farbod The issue you're having is that merging your edgeR and DESeq2 results did not result in a dataframe with the same colmn names as in my example. This could have been remedied if you'd posted some reproducible example data (this is good practise on any Q&A site). You can reproduce my code if you make a subdataframe from each of your DESeq2 and edgeR results that contains the columns gene_name, logFC, logFDR and threshold; use dplyr::select and dplyr::rename to achieve this

ADD REPLY

Login before adding your answer.

Traffic: 3310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6