Question

Correlation genes in multiple scRNA-seq

0

Entering edit mode

4.0 years ago

Wede ▴ 20

Hi there !

I'm analyzing multiple scRNA-seq datasets from a same tissue (different individuals). I ran standard pipeline analysis for each datasets (Seurat) and get different clusters and their associated genes. What i am trying to do is to compare cluster between datasets to identify co-occurring genes.

I did some simple things such as counting co-occurrence and found some genes that co-occurs almost in every datasets. However I'm not really satisfied with this descriptive analysis and want to do some statistical analysis to see if there is a correlation. I'm a bit lost to how to proceed to do this and which test I'm supposed to do. Any tips to share ?

Best, Wede

scRNA-seq R • 1.1k views

ADD COMMENT • link 4.0 years ago by Wede ▴ 20

score 2 · Answer 1 · 2021-05-20

I'm not sure there's a tool for exactly what you're looking for, but there are tools that probably come close in R. One I've seen recently is WebGestalt, which now has an R package if you prefer R over web interfaces.

From a theoretical standpoint, what you're looking to do is identify optimal subsets of genes that describe a larger set or set of sets. This is a task solved by weighted set cover in computer science; tools like WebGestalt that implement (sub)set comparisons will include a distance measure over enriched sets (Jaccard distance is a common one, WebGestalt uses the Sumer R package to do this). There are probably a few ways to provide a p-value around such a distance measure, some of which may involve permutation or bootstrap methods, but since these are algorithm-based, they often don't try to force the results into a framework of probability/significance and instead just report relevant weights for each gene set.

Note: the following assumes your data has some kind of experimental design around your individuals (treatment, control). Without some kind of grouping of samples with 3 or more individuals per group, you won't be able to use purely statistical methods.

If you're looking for a more statistically-oriented approach that assumes probability models from the start, I would search for Bayesian methods that deal specifically with single cell RNA-seq data. I'm finding a few on a google search (BASiCS, iDEA), but the keywords to look for would be single cell RNA-seq + Bayesian + enrichment. These methods will give you posterior probabilities (hopefully over enriched sets), which will give you more information than just a weight from an algorithm.

These integrative Bayesian models are going to be better than just counting individual occurrences of genes and trying to do univariate statistics over those. Since your genes are clustered, it suggests that the genes are somehow related or dependent on one another. You don't want to throw away that information by jumping to simple methods; try to use one of these Bayesian approaches that will model the information instead. You'll need to do some searching for the right one for your experiment, though.

Those papers will at least give you a starting point. I wouldn't say there is one agreed upon answer for how to do GSEA-based post-hoc analyses though.