Try this in basic plotting: (note: I created two dataframes with 100 genes and each dataframe shares 10 common genes with identical FDR and logFC). Common genes are colored in red . and labelled, rest in green and light green.
output:
input:
set.seed(100)
# Create a dataframe by name edge
edge = data.frame(
gene = paste0("gene",sample(100)),
logFC = c(rnorm(80,0,1),rnorm(20,0,12)),
logFDR = rnorm(100,mean=0.05, sd=0.02)
)
# Create a dataframe by name dsq
dsq =data.frame(
gene = paste0("gene",sample(100)),
logFC = c(rnorm(70,0,1),rnorm(30,0,12)),
logFDR = rnorm(100,mean=0.05, sd=0.01)
)
set.seed(200)
## Select random 10 genes from edge and their FC and FDR values
cf=edge[abs(edge$logFC)>2,][sample(nrow(edge[abs(edge$logFC)>2,]),10),]
#View(cf)
## Replace same genes in dsq with values above so that edge and dsq share 10 genes with identical FC and FDR.
dsq[dsq$gene %in% cf$gene,]=cf
## sort the dataframes
edge.sorted=edge[with(edge,order(gene)),]
dsq.sorted=dsq[with(dsq,order(gene)),]
## plot first dataframe
plot(
x = edge.sorted$logFC,
y = edge.sorted$logFDR,
col = "darkgreen",
pch = 16,
cex=2,
xlab="Log(2) Fold Change",
ylab="FDR",
abline(v=c(-2,2),h=c(0,0.05), col="red", lty=3,lwd=3)
)
## plot second data frame over first plot
points(
x = dsq.sorted$logFC,
y = dsq.sorted$logFDR,
col = "green",
pch = 16,
cex=2
)
## Highlight points of interest
points(
x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
col = "red",
pch = 16,
cex=2
)
## Add labels to points of interest
text(
x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
edge.sorted$gene[edge.sorted$logFC == dsq.sorted$logFC] ,
cex = 2,
pos=1,
col = "red"
)
in ggplot same (recycled code from @russh for merged dataframe creation, dataframe is code from above post):
output:
input:
# Load libraries
library(dplyr)
library(ggplot2)
# Merge data frames by gene name
dfm = merge(edge.sorted, dsq.sorted, by = "gene")
# Create a data frame for common genes by logFC and logFDR
dfm1 = dfm[dfm$logFC.x == dfm$logFC.y & dfm$logFDR.x == dfm$logFDR.y, ]
dfm1
head(dfm)
head(dfm1)
## plot
ggplot(dfm) +
geom_point(
data = dfm,
aes(x = logFC.x, y = logFDR.x),
color = "green",
cex = 3
) +
geom_point(
data = dfm,
aes(x = logFC.y, y = logFDR.y),
color = "lightgreen",
cex = 3
) +
geom_point(
data = dfm1,
aes(x = logFC.x, y = logFDR.x),
color = "blue",
cex = 3
) +
geom_text(
data = dfm1,
aes(x = logFC.x, y = logFDR.x, label = gene),
hjust = 1,
vjust = 2
) +
theme_bw() +
xlab("Log(2) fold change") +
ylab("FDR") +
geom_vline(
xintercept = 2,
col = "red",
linetype = "dotted",
size = 1
) +
geom_vline(
xintercept = -2,
col = "red",
linetype = "dotted",
size = 1
) +
geom_hline(
yintercept = 0.05,
col = "red",
linetype = "dotted",
size = 1
)
Please provide example data.
Dear zx8754. Hi and thank you for your help.
I did not get the point clearly about "example data". If you mean counts.matrix structure, it could be any counts, but the head of my "data" in the code above is as below"
For a given gene, how will you illustrate the connection between its result in DESeq and it's (dependent) result in edgeR?
A gene A in edgeR and DESeq will get twice a logFC and a p-value, so you cannot plot these at the same time using a volcano plot.
yes you can, you just plot two different points for the same gene. Whether it's meaningful to do so is up for discussion
Dear russhh and Wouter, Hi. Here is the problem
I have done DEG analysis by both edgeR and DESeq2 and I checked the overlaps of both package in my both conditions (condition1 and condition2) using a venn diagram and I tried to annotate the common/overlap transcripts reported in both exact test and GLM method. I also checked the results using SARTools. So, now I have two volcano for edgeR and DESeq2 that have about 30% overlap of DEGs in condition1 and 50% overlap in condition2.
I was wondering if I can show all these in a single volcano plot using 3 colours.
sorry, is it two different contrasts each of which has been tested using both edgeR and DESeq2?
I have used the same data and same expressin counts resulted from Trinity for condition1 and condition2 for both edgeR and DESeq2 and focused on the overlaps base on this idea if multiple programs give you the same results, then you can be confident that those results do not depend on the particular assumptions that are made by the programs/statistic tests.
I think the two methods are a bit too similar for that to be appropriate; I'd agree if you were comparing voom and DESeq2. I'd agree with you if you'd subsampled either the counts or the exons and ran the two methods on the two partitions. Not sure if the overlayed volcano would be of value to help choose between the methods - I'd rather see two MA plots stacked next to each other
Good idea! I will check and compare MA plots, too. Aren't MA plots similar to Volcano plots, in a whole perspective?
Not really, it will show how your diffexes vary across different levels of average expression for the two different methods
Okay you can, but no, I don't consider it meaningful :)
When plotting, you need to think what the message is that you want to give. In this case, that's quite unclear.
If you are asking me,
by this approach you (I) can first change the way of visualising the results (converting two different volcano and a venn diagram to only one volcano)
and also make it clear for the readers that the core DEGs you have chosen in the most conservative approach are significantly differentially expressed in both two statistical packages.