I have generated some gene signatures of cell states from an single-cell experiment (signatures A-H)
I want to classify cells into one of the states A-H by selecting the appropriate highest score from the gene signatures. Each cell should have 1 state called or if the is no clear assignment then the call should be NA.
However, the signature scores of different states have different distributions so I don't think it would be appropriate to just choose the max score for each cell.
I have made the scores avaialble in long-format:
test.df <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS5jLScYx4AaCiqRZwDqnqf41ozSzvLMLOWUU5VLT2FJ7XOBWbjJe_NLMOkK7-ndZ7m1LNFcD8ARB5L/pub?output=csv")
library(ggplot2)
library(pals)
kelly.cols <- kelly(22)[-c(1:2)]
ggplot(test.df,
mapping = aes(x = value, fill = variable)) +
geom_histogram(binwidth = 0.01) +
theme_bw() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(linetype = "dotted"),
panel.grid.major.x = element_line(linetype = "dotted")) +
scale_fill_manual(values = kelly.cols[1:8])
I would like some advice on how to call the state based on the signature scores provided.
You can obtain the wide-format version of the data using:
test.mat <- matrix(test.df$value, nrow = nrow(test.df)/8, ncol = 8, byrow = FALSE,
dimnames = list(1:(nrow(test.df)/8),
LETTERS[1:8]))
Apologies i thought that was clear from "I want to select 1 value per cell from the gene signatures to represent the cell state". I have amended the post.
I'm trying to classify cells into states. A cell can be in only one state, or have an NA.
This can be problematic if your scores are correlated. What are the spearman correlations between A-H?
There are a few ways to go about doing this. One of the simplest would be to set two thresholds: the minimum score (or rank) to make any call at all {a cell must have >this for at least one score), and a minimum gap between best and next-best to classify.
An alternative way would be to assign the top (say) 1% of cells from each class the corresponding label, and using a multi-label classifier to apply labels and probabilities, and then set a probability threshold.
Thanks for that. What I've already done seems to fit your initial suggestion:
I calculated the gap score
And set a minimum threshold of 0.05
And then any calls that have a score < 0 have been set to NA also.
Do you have any code you could provide for the multi-label approach you mentioned? I don't have any experience with making classifiers
You'd basically set up a dataframe that had columns
score_A score_B ... score_H classification
and populate it with all of your data (the class for the top 1% or whatever, and NA elsewhere), and do something liketo get the predictions & probabilities. See https://www.rdocumentation.org/packages/e1071/versions/1.7-14/topics/svm for details - obviously you can use any multi-class classifier, but svm is as fine a place as any to strrt.
Cheers will give it a go.