correct way of analyzing cell proportions in singlecell data
1
3
Entering edit mode
4.2 years ago

Hello

In Seurat there is a function to take the proportions of each cell identity so you can easily plot it with ggplots or something similar. However, most scRNA datasets I have seem (I mostly reanalyze data) have different sample sizes for each condition. So I'm sure just taking the proportions of cells might not be adequate. I believe you would need to normalize this. The first thing that comes to mind is dividing the number of cell identities by the number of conditions, but it still doesn't make much sense I guess, as sometimes the same conditions may have a high variation of cell identities too. Here the authors plot it by log2 of relative proportions, which I believe it is Z-score, but still it is a bit weird to me, as they have different numbers of samples in each status.

I couldn't find any Seurat vignette addressing this. Any solutions? Does my concern make sense?

scRNA-seq single-cell • 13k views
ADD COMMENT
0
Entering edit mode

Hi, this is a very important and helpful question. However, I am a little unsure of why we can't just perform a standard Fischer's exact test or chi-square test in this regard. Let us say I have 5 clusters in condition A and 5 clusters in condition B. Can't I just compare the proportion of cells in each cluster across condition (even if the sample sizes are different) and ask whether the proportion difference I am observing is significant or not? Sorry if this question is too dumb. I would really appreciate any insights with this.

ADD REPLY
13
Entering edit mode
4.2 years ago

To compare cell proportions between conditions, I've found using a monte-carlo/permutation test to be the most sensible and robust way. The null hypothesis you want to test against is that the difference in cell proportions for each cluster between conditions is just a consequence of randomly sampling some number of cells for sequencing for each condition. To generate this null distribution, you "pool" the cells between both samples together, and then you randomly segregate the cells back into the two conditions maintaining original sample sizes. You then recalculate the proportional difference between the two conditions for each cluster, and compare that to the observed proportional difference for each cluster. I tend to take the log2 difference in proportions since it's a more sensible scale. Repeat this process about 10,000 times, and the p-value would be the number of simulations where the simulated proportional difference was as or more extreme than observed (plus one) over the total number of simulations (plus one).

Since I found myself having to do this so many times, I made a little R library for myself that takes a seurat object, and will do a permutation test for p-values (and adjusted p-values), as well as generate a plot with the observed proportional difference and a bootstrapped confidence interval for each cluster.

https://github.com/rpolicastro/scProportionTest

ADD COMMENT
0
Entering edit mode

based, wish I could like your response 10 times If you have an article with it, please let me know so I can cite it

ADD REPLY
0
Entering edit mode

Very nice implementation!

How do you feel about the log2FD value, is 0.58 the lowest value we could use? I know it goes back to having a Fold Change of 1.5 but it seems to me that this value can be kind of arbitrary sometimes. I have used your library to my data and I'm testing some obs_log2FD values.

Thanks a lot for posting it!! Cheers!

ADD REPLY
1
Entering edit mode

I'm glad you've found some use for it!

I personally use a Log2FC of 1, corresponding to a doubling of abundance. I know it's sort of arbitrary but I consider this a big enough magnitude to likely be real and interesting. If you want to use a Log2FC of 0.58 I would also consider visually the bootstrapped CI, since it represents your certainty for the FC value too.

ADD REPLY
0
Entering edit mode

Hi @rpolicastro, To pick up on this question I want to ask for a clarification. I did this analysis but not sure whether the plot shows significance difference of sample1 compared to sample 2. In my case, the proportion of cell type in different affected status are as below enter image description here

But when I did permutation I expected to see sth not that different. But the result is as below

enter image description here

So, if the result is comparison of sample1=ALS compared to sample 2=control is it true that most off the subpopulation are overrepresented in ALS in contrast to the first plot? For example, in the first plot the OL population are overrepresented in ALS but in the second it's not. I appreciate any help

ADD REPLY
1
Entering edit mode

You should open your own question for this rather than trying to piggyback on another. You're more likely to get a response that way and it helps keep the site organized.

ADD REPLY
0
Entering edit mode
  1. Why is the log2 scale more sensible? Just so that the numbers of the scale look nicer or a more important reason?
  2. By "log2 difference in proportions" do you mean the difference between the log2 of the mean proportion of Condition1 vs the log2 of the mean proportion of Condition2 ?
  3. Can one adapt the library to work with an object not from Seurat (i.e. a matrix of proportions for Control, and another matrix with proportions for Treatment)?
ADD REPLY
0
Entering edit mode

The scProportionTest method consistently yielded statistically significant results (p-value < 0.05) (which is people desired); however, the observed direction of change often contradicted that obtained from a boxplot (take one sample instead of cells as one observation) as pointed out by @paria.

The classical wilcox.test is widely used by many researchers as a method for comparing the proportion of a specific cell type between sample traits. It takes one sample as an observation (instead of cells), making it a robust method and may not yield significant results for small samples.

The scProportionTest method internally utilizes permutations to compare the risk ratios for two categorized variables, considering one cell as an observation. This is the reason why it often yields significant results since the sample size is large (I don't think this is a good method comparing with wilcox.test since it will often exaggerate the statistical pvalue). Similar results can also be achieved using the chisq.test method (maybe wrong, I'm not a statistician).

ADD REPLY

Login before adding your answer.

Traffic: 1619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6