Question

Overriding a color palette in R to emphasize specific data points

0

Entering edit mode

8 months ago

jen ▴ 10

Hi everyone, I am making some plots to visualize my sequencing data and I am struggling to change the colors of some of my points using ggplot2. The problem is that I am trying to use the viridis palette but plot ONLY the values with percid = 0 (or 0.000 according to the data frame) in grey. This means I want to override the viridis scheme in the specific case that percid = 0. Any point with a percid NOT = 0 should keep the regular viridis color scheme. That means a data point within ANY order can have a percid = 0. Therefore, no colors of the legend should be colored in grey and there shouldn't be a creation of a "Zero Percent Identity" category. The only thing that should change is the color of my points (of any order or kmer_cov) with percid = 0 to grey. I am plotting percid on the x axis and kmer_cov on the y axis, and this is my code:

ggplot(contiginfo, aes(x = kmer_cov, y = percid, size = querylength, color = factor(order))) +
  geom_point(aes(color = ifelse(percid == 0, "Zero Percent Identity", order))) +
  scale_x_log10() +  # Use a logarithmic scale for the x-axis
  scale_size_continuous(range = c(1, 10)) +  # Adjust the size range as needed
  scale_color_manual(values = c("grey", viridis::viridis_pal()(length(unique(contiginfo$order))))) +  # Use grey for zero percent identity, Viridis for others
  labs(x = "kmer_cov", y = "percid", size = "Query Length", color = "Order") +
  theme_minimal() +
  theme(axis.title.x = element_text(size = 12),  # Change size of x-axis label
        axis.title.y = element_text(size = 12)) +
  guides(color = guide_legend(override.aes = list(size = 4)))

Now this is what I get as output: R plot

Any advice is appreciated, thanks!

R ggplot2 • 1.2k views

ADD COMMENT • link updated 8 months ago by swbarnes2 14k • written 8 months ago by jen ▴ 10

2

Entering edit mode

8 months ago

Trivas ★ 1.8k

A possible workaround could be to do two separate geom_point calls; one with your full dataset, and then the second (order is important here!) with a subsetted dataset that only contains data corresponding to percid = 0. You would manually set color = "grey" for the second geom.

ADD COMMENT • link 8 months ago by Trivas ★ 1.8k

2

Entering edit mode

8 months ago

swbarnes2 14k

Just a suggestion, with this code, you get a list of 74 colors where adjacent colors are all distinct from each other. So instead of looking like a rainbow where one color blends into the next, they will all look distinct.

library(RColorBrewer)

qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))

I do agree with Matthias that you can't put all this data into one plot. If someone really wants to know the value for Decapoda, they should look it up in a table. The visualization should be giving some kind of overall summary, or pointing out some trend across multiple samples that is hard to see in a table format.

ADD COMMENT • link 8 months ago by swbarnes2 14k

score 3 · Accepted Answer · 2024-03-13

3

Entering edit mode

8 months ago

Matthias Zepper 5.0k

I would subset your data accordingly and just use two geom_point() layers:

ggplot(aes=aes(x = kmer_cov, y = percid, size = querylength, color = factor(order)))  + geom_point(data=subset(contiginfo, percid > 0) + geom_point(data=subset(contiginfo, percid == 0), color="grey")

But I would refrain from plotting the data like this altogether. The human eye is very poor to distinguish nuanced shades of colour, and you have a lot of orders here. What would be the message of your plot? Will people want to compare the results for two different orders against each other? Or get an idea of the kmer-coverage in the whole dataset or for each order? I strongly recommend putting your orders as one discrete axis (preferably y for readability) and rather showing the numbers as summarized statistic, e.g. as boxplot or density. You could plot kmer_cov, percid and query length in facets reusing the same y-axis.

ADD COMMENT • link 8 months ago by Matthias Zepper 5.0k

1

Entering edit mode

Completely agree with you- I am still figuring out how to logically organize my data and decide what I am trying to communicate. Going to try and group the colors by super kingdom. By the way, this is a virome sequencing project which makes sense given the number of sequences that were classified as viral. Thanks for the help and suggestions!

Progress update: enter image description here

ADD REPLY • link 8 months ago by jen ▴ 10

0

Entering edit mode

Better. But at least visually, there doesn't seem to be a correlation between kmer_cov and percid, so there is not really a reason to plot the two dimensions against each other? Your Query Length is a normalization factor? In that case, you could directly plot the normalized values...

PS: Shape could be a useful aesthetic to visually discriminate Viruses, Eukaryota and Bacteria in a dot plot.

ADD REPLY • link 8 months ago by Matthias Zepper 5.0k