Question

hypergeometric test for under-representation/over-representation of cells in a cluster (scRNA-seq) (in R)

0

Entering edit mode

3.7 years ago

giegie • 0

I have a single-cell RNA-seq dataset. I am trying to find out if cells coming from a time-point 1 are more or less abundant compared to timepoint 2 in cluster1. The timepoints are imbalanced and therefore I think the hypergeometric test would be suitable here. I am not sure how to apply it.

Cluster 1

time-point 1 (93 cells)
time-point 2 (261cells)

total number of cells coming from timepoint 1 (597 cells) total number of cells coming from timepoint 2 (2014 cells)

Here I found an example of the application but I am not sure if I put the values for my case correctly.

Test for under-representation (depletion)

http://mengnote.blogspot.com/2012/12/calculate-correct-hypergeometric-p.html

phyper(hitInSample, hitInPop, failInPop, sampleSize, lower.tail= TRUE)

phyper(93, 597, 2014, 354, lower.tail= TRUE)

[1] 0.9548432

So that would mean that time-point 1 is not underrepresented in cluster1?

Is that correct?

rna test • 2.3k views

ADD COMMENT • link updated 3.7 years ago by rpolicastro 13k • written 3.7 years ago by giegie • 0

score 1 · Answer 1 · 2021-03-22

I'm not sure the hypergeometric test is the best approach here. The proportion of cells in one cluster is dependent on the proportion of cells in all other clusters, so you could for example see depletion of all other cell types if there is a proportionally large increase in just one cell type. This is often called compositional bias. Furthermore, differences in sequencing depth could also lead to differences in cell proportions, since smaller populations would experience higher variance at varying sequencing depths. You really want to model or approximately model this as a multinomial distribution with additional statistical considerations.

Various methods of note have been developed to deal with this problems. In Bioconductor OSCA they treat differential abundance similar to differential gene expression. They model each population using a negative binomial distribution, and then correct for library size differences, stabilize variance, and later consider compositional bias in a post-hoc manner. An alternative approach scCODA models the question more directly using an actual multinomial distribution. They further account for uncertainty in clustering by taking a bayesian approach, and also account for over dispersion using a Dirichlet-multinomial. There are also methods such as DA-seq that that don't use precomputed clusters. I haven't worked with these much since I usually want to calculate differential composition/abundance relative to some "ground truth" clusters. Alternatively, you could also model the entire question using a monte-carlo simulation and approximate the results you would get using the more advanced methods.