Removing Outliers from volcano plot
1
1
Entering edit mode
10.2 years ago
tiago211287 ★ 1.5k

I used this R script of rnaseqdata to generate a Volcano plot:

but there are 3 outliers point deforming the graphic shape. How can I remove it ?

resultado_table=rbind(data.frame(results(dds, contrast=c("condition", "Controle" , "Infectado"))))
resultado_table$Variable[1:39179] <- "Controle_vs_Colombiana"

##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off
resultado_table$threshold = as.factor(abs(resultado_table$log2FoldChange) > 1 & resultado_table$padj < 0.05)
write.table(resultado_table, file="./GENES_DE")

volcano_plot <- ggplot(data=resultado_table, aes(x=log2FoldChange, y=-log10(padj) , colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  xlab("\n log2 fold change") +
  ylab("-log10 p-value adjusted \n") +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  facet_wrap(~ Variable, ncol=3) +
  theme(strip.text = element_text(size=12, face="bold"))
volcano_plot
volcanoplot outliers • 8.0k views
ADD COMMENT
0
Entering edit mode

You can just use the xlim() and ylim() options to just set the bounds and not have to literally remove the datapoints.

ADD REPLY
0
Entering edit mode

Thank you Devon, as always you helping me. But where exactly I use this to actually work?

ADD REPLY
0
Entering edit mode

No problem. I should note there's a third option that's used by DESeq2 and a few other programs. In short, you modify the values that you're plotting such that anything outside of the bounds you want is now exactly on the border. You then denote that by changing the symbol used for these points. An example of that with an MA plot is below, where triangles denote values outside of the bounds.

I should note that this image is originally from Deseq's plotMA color-coding at random?

ADD REPLY
0
Entering edit mode

I'm actually trying to create a volcano plot in ggplot2 that INCLUDES outliers and displays them as triangles at the axis border. How can I code this? Thanks!

ADD REPLY
1
Entering edit mode

Assuming you have everything in a dataframe and the values are the "values" column:

df$shape <- ifelse(abs(df$values)>5, "triangle", "circle")
df$values[df$values>5] <- 5
df$values[df$values< -5] <- -5

You then use shape=shape in the aesthetic.

ADD REPLY
0
Entering edit mode

Thanks, it works!

What if I wanted to display the outliers like this for both the x and y axes?

ADD REPLY
1
Entering edit mode

It's the same general idea. You take whatever holds your X axis values and

df$shape[df$something>50] <- "triangle"
df$something[df$something>50] <- 50

I picked 50 as an arbitrary value.

ADD REPLY
0
Entering edit mode

Hmm, when I apply the code to both axes, all of my points just appear as a vertical line at x=1.5

Here's my code:

#y axis

df$shape <- ifelse(abs(df$pval) < 0.1, "triangle", "circle")
df$pval[df$pval<0.1] <- 0.1

#x axis

df$shape <- ifelse(abs(df$logFC) > 2.8, "triangle","circle")
df$logFC[df$logFC > 2.8] <- 1.5
df$logFC[df$logFC < -2.8] <- -1.5

g <- ggplot(df, aes(x=logFC, y=-log10(pval), shape=shape)) +
  geom_point(alpha=0.5, size=4) +
  theme(legend.position = "none") +
  xlim(c(-1.5, 1.5)) + ylim(c(0, 1)) +
  xlab("log2 fold change") + ylab("-log10 p-value")
ADD REPLY
1
Entering edit mode

A couple things:

  1. Don't set shape multiple times like that. You need to set just a subset of shape the second time: df$shape[abs(df$logFC)>1.5] <- "triangle"
  2. You're doing the trimming of logFC a bit weird. I assume you intend to do something like df$logFC[df$logFC>1.5] <- 1.5 and df$logFC[df$logFC < -1.5] <- -1.5.
  3. You don't need the xlim() part.

If for some reason you still get a line at x=1.5, then just have a look at df$logFC after each of the two trimming steps. It should be apparent then when things are going wrong and why.

ADD REPLY
2
Entering edit mode

Got it. The line of points was due to me screwing up the cutoff for replacing values and then forgetting to reset them to original values before trying the code again.

I end up having to set the xlim and ylim to the values that give the best zoom of the the majority of data points. I had 10 of so points that were very distant from the rest hence me finding this thread! Thanks for the help!

Here's my code and the volcano plot it produced for anyone's reference:

df$shape <- ifelse(df$pval < 0.03, "triangle", "circle")
df$pval[df$pval<0.03] <- 0.03

df$shape[(abs(df$logFC) > 1.5)] <- "triangle"
df$logFC[df$logFC >  1.5] <- 1.5
df$logFC[df$logFC < -1.5] <- -1.5

ggplot(data=df, aes(x=logFC, y=-log10(pval), shape=shape)) +
  geom_point(alpha=0.5, size=4) +
  theme(legend.position = "none") +
  xlim(c(-1.5, 1.5)) + ylim(c(0, -log10(0.03))) +
  xlab("log2 fold change") + ylab("-log10 p-value")

< image not found >

ADD REPLY
4
Entering edit mode
10.2 years ago

If you really want to remove the outliers you could remove the values for the x and/or y axis the data points outside a given quantile range. E.g. to remove values less than 0.01 and greater than 0.99 quantile:

qqx<- quantile(resultado_table$log2FoldChange, p= c(0.01, 0.99))
qqy<- quantile(resultado_table$padj, p= c(0.01, 0.99))

resultado_table2<- resultado_table[
    (resultado_table$log2FoldChange > qqx[1] & resultado_table$log2FoldChange < qqx[2]) &
    (resultado_table$padj > qqy[1] & resultado_table$padj < qqy[2]), ]

Now use resultado_table2 for plotting.

ADD COMMENT
0
Entering edit mode

Just 3 months later I figure out why I could not make this work. You solution works like a charm. Thank you. I was using write csv and cutting the values on excel then bringing back to R. But I came again on the post to read and figure out the problem.

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6