Question

Differential gene score in single cell experiment

0

Entering edit mode

13 months ago

Bastien Hervé 5.9k

Hello Biostars' community, long time not posting here, hope you are all doing well :)

I have a single cell multiome (RNA+ATAC in the same nuclei) experiment, made of 20 samples across 4 time points (tpCtrl,tpA,tpB,tpC) of a disease.

At a more advanced stage of my analysis, I am not interested anymore in neither gene expression or gene accessibility but rather in a "gene regulatory chromatin" score defined as the accessibility of peaks (putative enhancers) associated to a gene.

I have a sparse matrix of around 2500 genes x 60000 cells with floating values depicting a regulatory chromatin environment (details below).

My goal is to find out if some genes have significant differential regulatory chromatin environment between my time points, for example tpA vs. tpCtrl. In addition I would also need to add a qualitative covariate to the design.

My scores distribution is more or less similar to a single cell RNA expriment but my sparse matrix is not composed of integer raw counts but floating numbers.

I think it would still make sense to use a negative binomial distribution which would also capture the over dispersion of my data. Unfortunatly, pseudocounts methods like EdgeR or DESeq2, do not allow modified or normalized counts as starting point of the differential analysis.

In a simple manner, I looked into the FindMarkers from Seurat to calculate foldChange by a rowMeans function, however I am not able to include any covariate in this method.

Would ANCOVA gene by gene, followed by a Bonferroni correction, be a potential way to go ?

Scenic+ might be able to answer this question, but I have heard it is rather slow and eating a lot of RAM.

The "gene regulatory chromatin" score calculation :

Following this paper :

Using both RNA and ATAC, find out peaks correlated with the expression of nearby genes which outputs a list of links between peaks and gene with a z-score and p-value.

First, for each cell, the number of reads falling in each peak present in the list of links previously described (pvalue <0.05), will be divided by the number of reads falling into all the peaks in that given cell.

Which create a normalized peak matrix. (peaks x cells with normalized accessibility)

Second, for each cell and each gene present in the list of links, all peaks values associated to that gene are summed (from the normalized peak matrix).

For the sake of readability all values are multiply by a scaling factor of 10000.

Which create a normalized gene score matrix. (genes x cells with normalized gene score)

Scores are ranging from 0 to 130

single-cell • 1.2k views

ADD COMMENT • link updated 13 months ago by shunyip ▴ 250 • written 13 months ago by Bastien Hervé 5.9k

score 1 · Answer 1 · 2023-10-14

1

Entering edit mode

13 months ago

shunyip ▴ 250

From your question, I believe you are trying to find a single cell gene/peak read count differential analysis tool, which is technically not in the literature.

The Linnorm.limma function in the Linnorm library is what you're looking for. It performs normalization and log plus one transformation on the data then feeds it to limma for differential analysis.

Disclaimer: I am the author of Linnorm, an R library on Bioconductor.

ADD COMMENT • link 13 months ago by shunyip ▴ 250

0

Entering edit mode

Follow-up question: is there a reason you prefer log-pseudocount to a variance-stabilizing transformation like the DESeq2 VST or limma::voom ? I tend to treat the peak counts akin to gene counts and take the pseudobulk (one for each sample/condition) within each contrast and throw these into edgeR/limma. In your experience, is this approach inappropriate for single-cell ATAC?

ADD REPLY • link 13 months ago by LChart 4.5k

1

Entering edit mode

During Linnorm's development, I experimented with many ideas and methods that are similar to edgeR or DESeq, which didn't convert the data to log-pseudocount. However, I couldn't find a distribution (like Poisson, Negative Binomial, etc.) that fit the data well, especially with scRNA-seq. Later on, I found that transforming the data into Gaussian distribution (log-pseudocount) was an easier, computationally faster task and it also produced better results. Therefore, Linnorm went with this methodology.

edgeR/limma didn't consider some issues encountered by single cell read count data (like RNA-seq or ATAC-seq). However, they should work fine. Nevertheless, many review articles found that using single cell specific tool would result in better accuracy.

ADD REPLY • link 13 months ago by shunyip ▴ 250

0

Entering edit mode

Thanks for the kind reply, I am actually performing a transformation on the gene/peak counts before any differential analysis tool. I don't have any counts to play with anymore, for each gene in each cell I have a score and I want to compare this gene score between different time points.

ADD REPLY • link 13 months ago by Bastien Hervé 5.9k

1

Entering edit mode

Let me directly comment on the methodology to make it clearer.

Using both RNA and ATAC, find out peaks correlated with the expression of nearby genes which outputs a list of links between peaks and gene with a z-score and p-value.

Performing Linnorm transformation to both data prior to calculating correlation would reduce noise and improve your results.

First, for each cell, the number of reads falling in each peak present in the list of links previously described (pvalue <0.05), will be divided by the number of reads falling into all the peaks in that given cell.

This is similar to a CPM normalization, but without the million part and total read count of the sample is replaced by total read count of the peaks in the sample. It doesn't significantly change the underlying distribution of the data. If you performed Linnorm transformation earlier, you could simply do a expm1 (an R function) on the data; then it basically replaced this step.

Which create a normalized peak matrix. (peaks x cells with normalized accessibility)

Second, for each cell and each gene present in the list of links, all peaks values associated to that gene are summed (from the normalized peak matrix).

Summing doesn't change the distribution of the data. Feel free to do that.

For the sake of readability all values are multiply by a scaling factor of 10000.

Linnorm has already applied the most appropriate scaling factor for you. (It is called lambda in the methodology) You may skip this step.

Which create a normalized gene score matrix. (genes x cells with normalized gene score)

Scores are ranging from 0 to 130

The ranges have changed but, in either case, this is still a normalized read count matrix. You can perform a log1p transformation then apply limma here to get a pvalue for each peak.

ADD REPLY • link 13 months ago by shunyip ▴ 250