I am trying to analyze single-cell RNAseq (more specifically spatial transcriptomics) data.
I would like to calculate an index for an experimental condition. For example, I want to know if a particular gene is spatially enriched in a certain region (manually defined, therefore, it also analogs to ordinary scRNAseq of using sorting specific cells before performing sequencing).
Given I cannot find an appropriate pipeline, I manually defined an index for my analysis. However, I would like to know if this index is statistically sound. My intuition is: I can do permutation/ shuffling of the samples and see how many permutated samples have an index that is more extreme than the actual one.
I have multiple samples per group, and each has a different number of cells/regions defined as ROI. I am currently permuting the cells within the samples.
I am the following questions:
- Although I can calculate a "p-value" analog by permutation for each sample, I am not sure how to combine these "p-value" for all these samples in the same experimental group, so that I can get a sense that if the index is statistically significant for the same group. (I believe this is a one-sample permutation test question, but somehow it involved multiple samples).
- In the case of making a comparison between multiple groups of samples, would it be statistically sound to compare the index between groups if one/ more groups have an index that is not "more extreme than" the permutated samples?
Although it is not the focus of the current question, it would be great if you can suggest if there is any existing established pipeline for such a test. I found that many protocols mentioned autocorrelation, although I am afraid this is not what I want. Thanks!
======update======== I also rephrase the same question in Cross Validated
Here is a toy dataset
# toy dataset
set.seed(100)
group1_data1 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
set.seed(200)
group1_data2 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
set.seed(300)
group2_data1 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
set.seed(400)
group2_data2 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
rownames(group1_data1) <- paste0("gene_", 1:100) # can be another type of data, naming is not important here
colnames(group1_data1) <- c(paste0("region_A_", 1:33), paste0("region_B_", 1:33), paste0("region_C_", 1:34)) # regions are manually defined, and region_A is the ROI in this example
# Just an example of the calculation of the index, but this is not the main focus of the current question
calculate_indexA <- function(input) {
mean(input[, 1:33])
}
region_A_index <- calculate_indexA(group1_data1)
# Permutation test
set.seed(1)
group1_data1_shuffle1 <- group1_data1[, sample(100)]
set.seed(2)
group1_data1_shuffle2 <- group1_data1[, sample(100)]
set.seed(3)
group1_data1_shuffle3 <- group1_data1[, sample(100)]
set.seed(4)
group1_data1_shuffle4 <- group1_data1[, sample(100)]
set.seed(5)
group1_data1_shuffle5 <- group1_data1[, sample(100)]
region_A_index_shuffle <- sapply(list(group1_data1_shuffle1, group1_data1_shuffle2, group1_data1_shuffle3, group1_data1_shuffle4, group1_data1_shuffle5), calculate_indexA)
# Calculate the pvalue
mean(region_A_index_shuffle > region_A_index)
My question is:
- How to get a combined "p-value" for group 1 as a whole (that comprises data1 and data2)
- In the case of making a comparison between multiple groups of samples, would it be statistically sound to compare the index between groups if one/ more groups have an index that is not "more extreme than" the permutated samples?
Sorry for the late reply, I have updated my question with a toy dataset.
I am interested in the answers to (2) and (3). However, I think the (2) is not exactly what I want to solve, as I do not necessarily want to compare between two regions. Rather, I am interested to know if the current definition of the region will generate the index that is "significant" (rarer/ more extreme than from a random definition of regions).