Question

colouring variables in r

0

Entering edit mode

2.1 years ago

rene.j.erhardt ▴ 30

I am trying to plot a correlation between two variables with the datapoints in different colours and I am stuck here:

ggscatter(df, x = "Lachnospiraceae", y = "Akkermansiaceae", 
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color = "brown"), 
          cor.coef = TRUE, cor.method = "spearman",
          cor.coef.coord = c(1900,2200),   
          cor.coef.size = 4, 
          xlab = "Lachnospiraceae", ylab = "Akkermansiaceae")

I don't have a grouping variable, just 2 columns of counts of bacterial families present in every person. The above code gives me all datapoints in black. Anyone with a good idea?

R • 1.9k views

ADD COMMENT • link updated 2.1 years ago by seidel 11k • written 2.1 years ago by rene.j.erhardt ▴ 30

1

Entering edit mode

I don't necessarily understand everything in your R code, but pretty sure that what you want can't be done. It may seem sometimes that plotting functions can perform magic, but they usually need to know data classes before doing so. I think you need to add a third column where each data point will get a category corresponding to the colors you wish to use, and then color by using that column.

ADD REPLY • link 2.1 years ago by Mensur Dlakic ★ 29k

1

Entering edit mode

What is the criteria for colouring the datapoints? Once you have it, you can use color and palette arguments inside ggscatter function.

ADD REPLY • link 2.1 years ago by iraun 6.2k

0

Entering edit mode

Apparently, it's more difficult than I thought. My simple idea was I have 2 columns with data and all I wanted was one column in one colour and the other in a different colour instead of all in the same. When I added to the second line:

col = "blue" I get all points in blue instaed of black, that's why I was hoping to be able to create a code line which could give me two colours. A third column with 'group' wouldn't help because both families are present at the same time.

ADD REPLY • link 2.1 years ago by rene.j.erhardt ▴ 30

1

Entering edit mode

But think this out: you have an x,y plot, which means that each point on your plot consists of two values: one from Lachnospiraceae, the other from Akkermansiaceae. For example, a given point might be (x, y) = (83, 102). Which number should be used to determine the color? What does the color of a single point indicate? What question are you trying to answer with x, y? What additional question are trying to answer with a color? Are you looking for a different kind of plot? Maybe two boxplots, one colored for L, the other for A (which would address the question: does the distribution of L differ from that of A?).

ADD REPLY • link 2.1 years ago by seidel 11k

score 1 · Answer 1 · 2023-03-18

I don't have a grouping variable, just 2 columns of counts of bacterial families present in every person.

If you have only two columns of data, there is essentially nothing to color by in a scatter plot. To confirm this, it would be helpful to show a little of your data. You might try to write out the question that would be answered by using color.

On the other hand, if you had some third type of data you want to represent with color, you could hand that column name to the color argument, as iraun points out:

library(ggplot2)
library(ggpubr)

# create some toy data
df <- data.frame(Lachnospiraceae=runif(60,1,1900), Akkermansiaceae=runif(60,1,2200))
# add something to color by
df$class <- sample(c(letters[1:3]), nrow(df), replace=TRUE)

ggscatter(df, x = "Lachnospiraceae", y = "Akkermansiaceae",
          color="class",
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color = "brown"), 
          cor.coef = TRUE, cor.method = "spearman",
          cor.coef.coord = c(1900,2200),   
          cor.coef.size = 4, 
          xlab = "Lachnospiraceae", ylab = "Akkermansiaceae")

If you have only two columns of data, but want to highlight some characteristic of the data itself (i.e. value ranges) you would still have to add that third column with that characteristic, example:

# create some toy data classified by value range
x <- runif(60,1,1900)
y <- runif(60,1,2200)
x <- x[order(x)]
y <- y[order(y)]

df <- data.frame(Lachnospiraceae=x, Akkermansiaceae=y)
# add something to color by
df$class <- "c"
df$class[x < 1200] <- "b"
df$class[x < 600] <- "a"

enter image description here