Question

How to show both edgeR and deseq2 results in a single volcano plot; highliting overlaps

4

Entering edit mode

7.2 years ago

Farbod ★ 3.4k

Dear Biostars, Hi.

I there any way to show the edgeR and DESeq2 DEG results in one single volcano, AND showing their overlaps in a different colour (e.g. Non-significant= blue, significant = red, significant overlapped between both package= orange) ?

Note: This is the code I usually use for drawing my volcano plot in Rsudio but I only can feed one counts.matrix file to it, each time.

library(ggplot2)
data <- read.table("trans.counts.matrix.J_vs_M.edgeR.DE_results", header=TRUE)
##Identify the genes that have a FDR < 0.001
data$threshold = as.factor(data$FDR < 0.001)
##Construct the plot object
g <- ggplot(data=data, 
            aes(x=logFC, y =-log10(FDR), 
                colour=threshold)) +

  geom_point(alpha=0.4, size=3) +
  xlim(c(-14, 14)) +
  xlab("log2 fold change") + ylab("-log10 FDR") +
  theme_bw() +
  scale_color_manual(values = c("blue", "red")) +
  theme(legend.position="none")
g

RNA-Seq R DEG volcano plot • 6.8k views

ADD COMMENT • link updated 6.1 years ago by zx8754 12k • written 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

Please provide example data.

ADD REPLY • link 7.2 years ago by zx8754 12k

0

Entering edit mode

Dear zx8754. Hi and thank you for your help.

I did not get the point clearly about "example data". If you mean counts.matrix structure, it could be any counts, but the head of my "data" in the code above is as below"

> head (data)
         external_gene_name     logFC   logCPM      P.Value          FDR threshold
1 TRINITY_DN179827_c0_g2_i4 -13.87327 4.618556 3.038112e-30 1.878978e-24      TRUE
2 TRINITY_DN194528_c5_g3_i4  13.26971 4.016180 1.219369e-27 3.770709e-22      TRUE
3 TRINITY_DN192883_c9_g2_i2  11.22692 3.821780 3.201530e-26 6.600158e-21      TRUE
4 TRINITY_DN185936_c0_g1_i1  12.85547 3.603356 2.980189e-24 4.607886e-19      TRUE
5 TRINITY_DN180711_c0_g1_i1 -11.87279 2.627359 1.814995e-20 2.245037e-15      TRUE
6 TRINITY_DN197264_c6_g1_i6  10.32434 2.894909 2.161066e-19 2.227587e-14      TRUE

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

For a given gene, how will you illustrate the connection between its result in DESeq and it's (dependent) result in edgeR?

ADD REPLY • link 7.2 years ago by russhh 5.8k

0

Entering edit mode

A gene A in edgeR and DESeq will get twice a logFC and a p-value, so you cannot plot these at the same time using a volcano plot.

ADD REPLY • link 7.2 years ago by WouterDeCoster 48k

2

Entering edit mode

yes you can, you just plot two different points for the same gene. Whether it's meaningful to do so is up for discussion

ADD REPLY • link 7.2 years ago by russhh 5.8k

0

Entering edit mode

Dear russhh and Wouter, Hi. Here is the problem

I have done DEG analysis by both edgeR and DESeq2 and I checked the overlaps of both package in my both conditions (condition1 and condition2) using a venn diagram and I tried to annotate the common/overlap transcripts reported in both exact test and GLM method. I also checked the results using SARTools. So, now I have two volcano for edgeR and DESeq2 that have about 30% overlap of DEGs in condition1 and 50% overlap in condition2.

I was wondering if I can show all these in a single volcano plot using 3 colours.

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

sorry, is it two different contrasts each of which has been tested using both edgeR and DESeq2?

ADD REPLY • link 7.2 years ago by russhh 5.8k

0

Entering edit mode

I have used the same data and same expressin counts resulted from Trinity for condition1 and condition2 for both edgeR and DESeq2 and focused on the overlaps base on this idea if multiple programs give you the same results, then you can be confident that those results do not depend on the particular assumptions that are made by the programs/statistic tests.

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

I think the two methods are a bit too similar for that to be appropriate; I'd agree if you were comparing voom and DESeq2. I'd agree with you if you'd subsampled either the counts or the exons and ran the two methods on the two partitions. Not sure if the overlayed volcano would be of value to help choose between the methods - I'd rather see two MA plots stacked next to each other

ADD REPLY • link 7.2 years ago by russhh 5.8k

0

Entering edit mode

Good idea! I will check and compare MA plots, too. Aren't MA plots similar to Volcano plots, in a whole perspective?

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

Not really, it will show how your diffexes vary across different levels of average expression for the two different methods

ADD REPLY • link 7.2 years ago by russhh 5.8k

0

Entering edit mode

Okay you can, but no, I don't consider it meaningful :)

When plotting, you need to think what the message is that you want to give. In this case, that's quite unclear.

ADD REPLY • link 7.2 years ago by WouterDeCoster 48k

0

Entering edit mode

If you are asking me,

by this approach you (I) can first change the way of visualising the results (converting two different volcano and a venn diagram to only one volcano)

and also make it clear for the readers that the core DEGs you have chosen in the most conservative approach are significantly differentially expressed in both two statistical packages.

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

score 5 · Answer 1 · 2018-05-10

Try this in basic plotting: (note: I created two dataframes with 100 genes and each dataframe shares 10 common genes with identical FDR and logFC). Common genes are colored in red . and labelled, rest in green and light green.

output:

input:

set.seed(100)
# Create a dataframe by name edge
edge = data.frame(
  gene = paste0("gene",sample(100)),
  logFC = c(rnorm(80,0,1),rnorm(20,0,12)),
  logFDR = rnorm(100,mean=0.05, sd=0.02)
)
# Create a dataframe by name dsq
dsq =data.frame(
  gene = paste0("gene",sample(100)),
  logFC = c(rnorm(70,0,1),rnorm(30,0,12)),
  logFDR = rnorm(100,mean=0.05, sd=0.01)
)

set.seed(200)
## Select random 10 genes from edge and their FC and FDR values

cf=edge[abs(edge$logFC)>2,][sample(nrow(edge[abs(edge$logFC)>2,]),10),]
#View(cf)

## Replace same genes in dsq with values above so that edge and dsq share 10 genes with identical FC and FDR.
dsq[dsq$gene %in% cf$gene,]=cf

## sort the dataframes
edge.sorted=edge[with(edge,order(gene)),]
dsq.sorted=dsq[with(dsq,order(gene)),]

## plot first dataframe
plot(
  x = edge.sorted$logFC,
  y = edge.sorted$logFDR,
  col = "darkgreen",
  pch = 16,
  cex=2,
  xlab="Log(2) Fold Change",
  ylab="FDR",
  abline(v=c(-2,2),h=c(0,0.05), col="red", lty=3,lwd=3)
)
## plot second data frame over first plot
points(
  x = dsq.sorted$logFC,
  y = dsq.sorted$logFDR,
  col = "green",
  pch = 16,
  cex=2
)
## Highlight points of interest
points(
    x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
    y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
  col = "red",
  pch = 16,
  cex=2
)
## Add labels to points of interest
text(
  x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
  y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
  edge.sorted$gene[edge.sorted$logFC == dsq.sorted$logFC] ,
  cex = 2,
  pos=1,
  col = "red"
)

in ggplot same (recycled code from @russh for merged dataframe creation, dataframe is code from above post):

output: input:

# Load libraries
library(dplyr)
library(ggplot2)

# Merge data frames by gene name
dfm = merge(edge.sorted, dsq.sorted, by = "gene")

# Create a data frame for common genes by logFC and logFDR
dfm1 = dfm[dfm$logFC.x == dfm$logFC.y & dfm$logFDR.x == dfm$logFDR.y, ]
dfm1
head(dfm)
head(dfm1)
## plot
ggplot(dfm) +
  geom_point(
    data = dfm,
    aes(x = logFC.x, y = logFDR.x),
    color = "green",
    cex = 3
  ) +
  geom_point(
    data = dfm,
    aes(x = logFC.y, y = logFDR.y),
    color = "lightgreen",
    cex = 3
  ) +
  geom_point(
    data = dfm1,
    aes(x = logFC.x, y = logFDR.x),
    color = "blue",
    cex = 3
  ) +
  geom_text(
    data = dfm1,
    aes(x = logFC.x, y = logFDR.x, label = gene),
    hjust = 1,
    vjust = 2
  ) +
  theme_bw() +
  xlab("Log(2) fold change") +
  ylab("FDR") +
  geom_vline(
    xintercept = 2,
    col = "red",
    linetype = "dotted",
    size = 1
  ) +
  geom_vline(
    xintercept = -2,
    col = "red",
    linetype = "dotted",
    size = 1
  ) +
  geom_hline(
    yintercept = 0.05,
    col = "red",
    linetype = "dotted",
    size = 1
  )

score 2 · Answer 2 · 2018-05-09

2

Entering edit mode

7.2 years ago

russhh 5.8k

If it were me, I'd link the edgeR and DESeq results using coloured edges, rather than colour the points:

edg <- data.frame(
    gene_name = letters[1:3],
    logFC = rnorm(3),
    logFDR = c(-10, -3, -1),
    threshold = c(T, T, F)
    )

dsq <- data.frame(
    gene_name = letters[1:4],
    logFC = rnorm(4),
    logFDR = c(-8, -4, -2, -5),
    threshold = c(T, T, F, T)
    )

merge(edg, dsq, by = "gene_name") %>%
    filter(threshold.x & threshold.y) %>%
    ggplot() +
    geom_segment(aes(x = logFC.x, y = logFDR.x, xend = logFC.y, yend = logFDR.y, group = gene_name), col = "orange") +
    geom_point(data = edg, aes(x = logFC, y = logFDR), alpha = 0.1) +
    geom_point(data = dsq, aes(x = logFC, y = logFDR), alpha = 0.1)

ADD COMMENT • link 7.2 years ago by russhh 5.8k

0

Entering edit mode

Thank you for your valuable script. I will try it and share the results/problem with you then.

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

I received this error: Error in eval(expr, envir, enclos) : object 'logFC.x' not found

ADD REPLY • link 7.2 years ago by Farbod ★ 3.4k

0

Entering edit mode

@Farbod: I guess you need to load libraries dplyr and ggplot2. On the top of the code, add

library(ggplot2)
library(dplyr)

ADD REPLY • link 7.2 years ago by cpad0112 21k

1

Entering edit mode

@Farbod The issue you're having is that merging your edgeR and DESeq2 results did not result in a dataframe with the same colmn names as in my example. This could have been remedied if you'd posted some reproducible example data (this is good practise on any Q&A site). You can reproduce my code if you make a subdataframe from each of your DESeq2 and edgeR results that contains the columns gene_name, logFC, logFDR and threshold; use dplyr::select and dplyr::rename to achieve this

ADD REPLY • link 7.2 years ago by russhh 5.8k