Question

Summarizing permutation test/ bootstrap result for multiple samples for single-cell RNAseq data

0

Entering edit mode

20 months ago

cwwong13 ▴ 40

I am trying to analyze single-cell RNAseq (more specifically spatial transcriptomics) data.

I would like to calculate an index for an experimental condition. For example, I want to know if a particular gene is spatially enriched in a certain region (manually defined, therefore, it also analogs to ordinary scRNAseq of using sorting specific cells before performing sequencing).

Given I cannot find an appropriate pipeline, I manually defined an index for my analysis. However, I would like to know if this index is statistically sound. My intuition is: I can do permutation/ shuffling of the samples and see how many permutated samples have an index that is more extreme than the actual one.

I have multiple samples per group, and each has a different number of cells/regions defined as ROI. I am currently permuting the cells within the samples.

I am the following questions:

Although I can calculate a "p-value" analog by permutation for each sample, I am not sure how to combine these "p-value" for all these samples in the same experimental group, so that I can get a sense that if the index is statistically significant for the same group. (I believe this is a one-sample permutation test question, but somehow it involved multiple samples).
In the case of making a comparison between multiple groups of samples, would it be statistically sound to compare the index between groups if one/ more groups have an index that is not "more extreme than" the permutated samples?

Although it is not the focus of the current question, it would be great if you can suggest if there is any existing established pipeline for such a test. I found that many protocols mentioned autocorrelation, although I am afraid this is not what I want. Thanks!

======update======== I also rephrase the same question in Cross Validated

Here is a toy dataset

# toy dataset
set.seed(100)
group1_data1 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
set.seed(200)
group1_data2 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
set.seed(300)
group2_data1 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)
set.seed(400)
group2_data2 <- matrix(data = rnorm(10000), nrow = 100, ncol = 100)


rownames(group1_data1) <- paste0("gene_", 1:100) # can be another type of data, naming is not important here
colnames(group1_data1) <- c(paste0("region_A_", 1:33), paste0("region_B_", 1:33), paste0("region_C_", 1:34)) # regions are manually defined, and region_A is the ROI in this example

# Just an example of the calculation of the index, but this is not the main focus of the current question
calculate_indexA <- function(input) { 
  mean(input[, 1:33])
}
region_A_index <- calculate_indexA(group1_data1)

# Permutation test
set.seed(1)
group1_data1_shuffle1 <- group1_data1[, sample(100)]
set.seed(2)
group1_data1_shuffle2 <- group1_data1[, sample(100)]
set.seed(3)
group1_data1_shuffle3 <- group1_data1[, sample(100)]
set.seed(4)
group1_data1_shuffle4 <- group1_data1[, sample(100)]
set.seed(5)
group1_data1_shuffle5 <- group1_data1[, sample(100)]

region_A_index_shuffle <- sapply(list(group1_data1_shuffle1, group1_data1_shuffle2, group1_data1_shuffle3, group1_data1_shuffle4, group1_data1_shuffle5), calculate_indexA)

# Calculate the pvalue
mean(region_A_index_shuffle > region_A_index)

My question is:

How to get a combined "p-value" for group 1 as a whole (that comprises data1 and data2)
In the case of making a comparison between multiple groups of samples, would it be statistically sound to compare the index between groups if one/ more groups have an index that is not "more extreme than" the permutated samples?

single-cell RNA-seq • 952 views

ADD COMMENT • link 20 months ago by cwwong13 ▴ 40

score 0 · Answer 1 · 2023-03-07

0

Entering edit mode

20 months ago

LChart 4.5k

So you have, for a specific gene:

Cell nested in Sample nested in Group, and a set of (spatial) Regions (which I assume are the same for all cells and samples?). Now what is the question:

(1) Within Region R, is the gene expressed more in Group A than in Group B (differential by biological group)

(2) Within Group A, is the gene expressed more in Region R than in Region S? (differential by spatial region)

(3) Is the difference between Region R and Region S more pronounced in Group A than in Group B? (interaction)

Also, can you explain a little bit more what you mean by "index". What is it measuring, and how did you define it?

ADD COMMENT • link 20 months ago by LChart 4.5k

0

Entering edit mode

Sorry for the late reply, I have updated my question with a toy dataset.

ADD REPLY • link 20 months ago by cwwong13 ▴ 40

0

Entering edit mode

I am interested in the answers to (2) and (3). However, I think the (2) is not exactly what I want to solve, as I do not necessarily want to compare between two regions. Rather, I am interested to know if the current definition of the region will generate the index that is "significant" (rarer/ more extreme than from a random definition of regions).

ADD REPLY • link 20 months ago by cwwong13 ▴ 40