Adding p value to Vlnplot in Seurat
1
3
Entering edit mode
4.2 years ago
ww22runner ▴ 60

Hello everyone,

Could anyone help me understand if there is any way I can add p values to violin plots produced by Seurat? Currently, I am trying to plot several genes across all clusters, broken up by responders vs non-responders and this is my code:

vp_case1 <- function (gene_signature, file_name){
  plot_case1 <- function(signature){
    VlnPlot(pbmc, features = signature,
            pt.size = 0.1, 
            group.by = "Response"
            ) + stat_compare_means(comparisons = pbmc$Response, label = "p.signif")
  }

  map(gene_signature, plot_case1) %>% cowplot::plot_grid(plotlist = .)
  file_name <- paste0(file_name, "_r.png")
  ggsave(file_name, width = 14, height = 8)
}

 vp_case1(gene_sig, "gene_sig")

where gene_sig is a vector of a few genes of interest.

However, I get the following error:

Warning messages:
1: Computation failed in `stat_signif()`:
not enough 'y' observations 
2: Computation failed in `stat_signif()`:
not enough 'y' observations 
.
.
.

How could I do this better?

Thank you!

seurat • 27k views
ADD COMMENT
0
Entering edit mode

Have you checked if VlnPlot and stat_compare_means can work together like that?

ADD REPLY
0
Entering edit mode

Hi igor,

Please correct me if I am wrong but it says VlnPlot returns "a patchworked ggplot object if combine = TRUE; otherwise, a list of ggplot objects" while stat_compare_mean can be used to "add mean comparison p-values to a ggplot" so I thought it should work.

ADD REPLY
12
Entering edit mode
4.2 years ago

Hi,

The problem that you got is related with the fact that you need to provide a list of comparisons to be made to the function stat_compare_means(), and you are providing a character/factor variable. So, if you adjust your code to:

vp_case1 <- function(gene_signature, file_name, test_sign, y_max){
plot_case1 <- function(signature){
    VlnPlot(pbmc, features = signature,
            pt.size = 0.1, 
            group.by = "Response", 
            y.max = y_max, # add the y-axis maximum value - otherwise p-value hidden
    ) + stat_compare_means(comparisons = test_sign, label = "p.signif")
  }
  map(gene_signature, plot_case1) %>% cowplot::plot_grid(plotlist = .)
  file_name <- paste0(file_name, "_r.png")
  ggsave(file_name, width = 14, height = 8)
}

In this adjusted code I added 2 extra parameters:

  • test_sign: that is a list of a vector of characters. So, in your case you have a Response variable that should assume 2 or more levels. Let's say that your Response variable can be CTRL or TRT (control or treatment). Thus test_sign = list(c("CTRL", "TRT")).

  • y_max: by default the function VlnPLot will adjust the y-axis to the maximum expression value per comparison made. What this means is that if you're comparing the gene A between CTRL and TRT and the max. value is 7, the maximum y-axis scale will be 7. When you apply the stat_compare_means() function, the p-value will be added above, but since the maximum y-axis value is 7, you'll not see it. Therefore, you can adjust the y_max to one more unit, let's say 8, and therefore you'll be available to see the p-value. Though as you can think this is not the best option since often you'll have distinct y maximum values, and therefore setting the same y_max will not be the best option. In that case what you can do is just a loop instead a map() giving a y_max a numeric vector. Let me know if you have problems regarding this issue.

Here is an example:

## Follow the Seurat tutorial until UMAP at: https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html

## create fake 'Reponse' variable
pbmc@meta.data$Response <- rep(c("CTRL", "TRT"), times = c(1500, 2638-1500))

vp_case1 <- function(gene_signature, file_name, test_sign, y_max){
  plot_case1 <- function(signature){
    VlnPlot(pbmc, features = signature,
            pt.size = 0.1, 
           group.by = "Response", 
            y.max = y_max, # add the y-axis maximum value - otherwise p-value hidden
    ) + stat_compare_means(comparisons = test_sign, label = "p.signif")
  }
  purrr::map(gene_signature, plot_case1) %>% cowplot::plot_grid(plotlist = .)
  file_name <- paste0(file_name, "_r.png")
  ggsave(file_name, width = 14, height = 8)
}

gene_sig <- c("NKG7", "PF4")
comparisons <- list(c("CTRL", "TRT"))
vp_case1(gene_signature = gene_sig, file_name = "Desktop/gene_sig", test_sign = comparisons, y_max = 7)

Result:

enter image description here

I hope this answers your question,

António

ADD COMMENT
1
Entering edit mode

Code updated below to give you an adjusted y-axis scale with p-value:

## Follow the Seurat tutorial until UMAP at: https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html

## create fake 'Reponse' variable
pbmc@meta.data$Response <- rep(c("CTRL", "TRT"), times = c(1500, 2638-1500))

vp_case1 <- function(gene_signature, file_name, test_sign){
  plot_case1 <- function(signature, y_max = NULL){
    VlnPlot(pbmc, features = signature,
            pt.size = 0.1, 
            group.by = "Response", 
            y.max = y_max # add the y-axis maximum value - otherwise p-value hidden
    ) + stat_compare_means(comparisons = test_sign, label = "p.signif")
  }
  plot_list <- list()
  y_max_list <- list()
  for (gene in gene_signature) {
    plot_list[[gene]] <- plot_case1(gene)
    y_max_list[[gene]] <- max(plot_list[[gene]]$data[[gene]]) # get the max no. for each gene
    plot_list[[gene]] <- plot_case1(gene, y_max = (y_max_list[[gene]] + 1) )
  }
  cowplot::plot_grid(plotlist = plot_list)
  file_name <- paste0(file_name, "_r.png")
  ggsave(file_name, width = 14, height = 8)
}

gene_sig <- c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
comparisons <- list(c("CTRL", "TRT"))
vp_case1(gene_signature = gene_sig, file_name = "Desktop/gene_sig_2", test_sign = comparisons)

Result:

enter image description here

Therefore, you don't need to worry with the y_axis scale.

António

ADD REPLY
0
Entering edit mode

Hi Antonio,

Thanks so much, these graphs are exactly what I needed! May I ask if we might be able to show the same information on a split violin plot for each gene for responder vs non-responders for a particular cluster?

Thank you!

ADD REPLY
0
Entering edit mode

Hi ww22runner,

Not sure. Did you try?

Let me know if it worked.

António

ADD REPLY
0
Entering edit mode

Hi antonioggsousa I tried your code and may I ask about the basis of significance? Is the significance: (a) influenced by the number of cells in each group (CTRL vs TRT) or (b) the significance is based on the differential "averaged expression" of the gene between them?

Because in my own data, it seems like the significance is about the number of cells. In my generated plot, I noticed that some have very high significance (4 asterisk stars) even though the 2 groups have a visually/probably equal or not too different expression level, for example the "CD8A" in my plot. However, when you look at CD14, THERE'S ONLY SINGLE * (1 asterisk star).

enter image description here

ADD REPLY
0
Entering edit mode

Hi @jvbeltran0122,

As far as I remember by default the function stat_compare_means() (link to the page documentation) will use the Mann–Whitney U test or also called Wilcoxon rank-sum test (read more about it in the wiki). Therefore, the significance depends on the formula of this test.

I think if you want to test differential gene expression, you should apply one of the available methods in Seurat in order to get that. I don't think that is very meaningful comparing the expression level of one gene across samples with the Wilcoxon rank-sum test, because you have a lot of cells and, even if the difference is small, it'll have an high probability of being statistically significant (whatever that means).

ADD REPLY
1
Entering edit mode

Hi there, was wondering the exact same thing...

If you do DE analysis between the groups using FindAllMarkers(), you will get both a p-value and a Bonferroni adjusted p-value where you take into account the number of tests. That's also using for instance the default Wilcoxon rank-sum test. I think the p-value on these plots correspond to the unadjusted ones. I also have several genes that appear differently distributed on the plots but they are not statistically significant if you look at p adjusted in the DE analysis.

I believe it would most likely be wrong to report the unadjusted one, although Bonferroni may be more stringent than necessary- it is calculated as the unadjusted p-value multiplied by the number of features, for me it's 17729 (which is the number of tests) although a large number of them are irrelevant... False discovery rate could be better but I am not sure how to calculate it from seurat objects yet.

Please correct me if I'm wrong, this is probably a discussion you can't have too many times :)

ADD REPLY
0
Entering edit mode

Hi, Antonio!

I'm new at Single Cell and I'm trying to perform an similar analysis with public data on GEO (GSE93374). In the command pbmc@meta.data$Response <- rep(c("CTRL", "TRT"), times = c(1500, 2638-1500)) What this values in times = c() means?

ADD REPLY
1
Entering edit mode

Hi Fernanda,

Basically that command is creating a new metadata variable in pbmc@meta.data called Response that holds one of two categories CTRL or TRT. Instead of creating a vector in R with these categories, such as c("CTRL", "CTRL", "CTRL", "CTRL", "CTRL", ...,"TRT", "TRT", "TRT", "TRT", "TRT", ...), where the 1st element in this vector would correspond to the category for the 1st cell, the second for the second cell and so on, I used the rep() function in R (you can read more about it here) . What this function does is to replicate a set of values provided in the first part of the function, in this case c("CTRL", "TRT"). times is the argument in this function that defines how many times do we want to replicate or repeat the elements provided in the first part of the function. If you use times argument you need to specify how many times do you want each element given before to be replicated/repeated. In this case I wanted to replicate 1500 times CTRL and 2638-1500 times TRT (in the latter I'm giving 2638-1500, because I'm subtracting the total no. of cells 2638 to 1500 which corresponds to the no. of cells that belong to CTRL - this gives us the no. of cells in TRT - I was lazy to calculate this myself and write down just one value ;) ).

If you want you can try and see what this does in you R console by typing the following: rep(c("CTRL", "TRT"), times = c(10, 10))

This will create a character vector in R with 20 elements, where the first 10 are CTRL and the second 10 are TRT.

I hope this helps,

António

ADD REPLY
0
Entering edit mode

Hello António!

I've try to run those codes above like this:

seurat@meta.data$Response <- rep(c("Fasted", "HFD"), times = c(10543, 10543))

vp_case1 <- function(gene_signature, file_name, test_sign){
  plot_case1 <- function(signature, y_max = NULL){
    VlnPlot(seurat, features = signature,
            pt.size = 0.1, 
            group.by = "Response", 
            y.max = y_max
    ) + stat_compare_means(comparisons = test_sign, label = "p.signif")
  }
  plot_list <- list()
  y_max_list <- list()
  for (gene in gene_signature) {
    plot_list[[gene]] <- plot_case1(gene)
    y_max_list[[gene]] <- max(plot_list[[gene]]$data[[gene]])
    plot_list[[gene]] <- plot_case1(gene, y_max = (y_max_list[[gene]] + 1) )
  }
  cowplot::plot_grid(plotlist = plot_list)
  file_name <- paste0(file_name, "rplot.png")
  ggsave(file_name, width = 14, height = 8)
}

gene_sig <- c("Hspd1", "Lonp1", "Yme1l1", "Clpp")
comparisons <- list(c("Fasted", "HFD"))
vp_case1(gene_signature = gene_sig, file_name = "seurat", test_sign = comparisons)

I used 10543 values in 'times', since my data has 21086 rows. But something isn't right. When I plot Vlnplot from my object I saw a different expression compared when I run this code, who appears like this. the image

And not like this, as I wanted (in this case, the expression of Lonp1 between all the samples).Lonp1

Do you know what am I doing wrong?

Since I'm comparing only two samples (fasted and HFD) from my data, do I have to put in 'times' when I create the 'Response' metadata the right number of rows in each samples? There's any code to find this numbers?

Thank you for all your help anyway!

ADD REPLY
0
Entering edit mode

Hi Fernanda,

I think there are some misunderstandings from the code above.

First the example that I gave above you have a very simple scenario where the Seurat object called pbmc comprises only cells from two samples or two conditions, if you prefer, which are _CTRL_ or _TRT_. Based on your last plot, this does not seem to be your case. If I understood correctly you have a variable in the meta.data slot from your Seurat object, named seurat, which comprises 11 samples or conditions, among them are Fasted and HFD, which I think are the only two conditions/groups that you're interested in compare. If so (this is an assumption from my side, since I don't know your data), you already have a variable or column name in your Seurat meta.data (which you can consult by doing head(seurat@meta.data) to print the first few lines of this table) that you are using to make the last plot above that holds the two conditions/groups that you want to compare, which are Fasted vs HFD.

If all the assumptions that I did above are correct, then what you're doing wrong is that you're using all your cells to plot the difference between Fasted vs HFD, even cells that do not belong to these two conditions/groups/samples, that's why your violin plots between conditions look virtually the same, because you are picking the first half of all your cells 10543 and saying that they are Fasted and the other half are HFD (very unlikely you would have the same exact no. of cells for two conditions).

If my assumptions are correct, what you want to do is to subset your Seurat object to contain only cells that belong to the conditions/groups/samples Fasted and HFD, and only then you can run the code above. I don't know the name of the column that you used to plot the plot above and which holds the variables Fasted and HFD. Let's say that it is called group. You can subset your Seurat object for the cells belonging to the conditions Fasted or HFD by doing (you should substitute group by the real name): subset_seurat <- subset(x = seurat, subset = group %in% c("Fasted", "HFD")) I'm not 100% sure if the code above works. If it doesn't please add a reply. You can check if it worked by plotting the same plot as it appears above (the last one) but instead of given the object seurat you give subset_seurat. Then you would expect the exact same plot but only the conditions Fasted and HFD represented.

Then, in the first line of code, you substitute rep(c("Fasted", "HFD"), times = c(10543, 10543)) by as.character(subset_seurat@meta.data[,"group"]). Again, you need to substitute group by the column name that you have, which I don't know which is. Here you are just picking up the character vector that holds Fasted or HFD in the right order from the meta.data slot from your subset_seurat Seurat object.

I hope this helps.

António

ADD REPLY
0
Entering edit mode

Hi António,

I've been using this function for my datasets. Do you have any idea how you could plot the logFC data for each VlnPlot on the graph? With some of the smaller differences, they are significant but it's not clear whether it was increased or decreased.

Thanks again for this cool function!

Patrick

ADD REPLY

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6