Hello Biostars' community, long time not posting here, hope you are all doing well :)
I have a single cell multiome (RNA+ATAC in the same nuclei) experiment, made of 20 samples across 4 time points (tpCtrl,tpA,tpB,tpC) of a disease.
At a more advanced stage of my analysis, I am not interested anymore in neither gene expression or gene accessibility but rather in a "gene regulatory chromatin" score defined as the accessibility of peaks (putative enhancers) associated to a gene.
I have a sparse matrix of around 2500 genes x 60000 cells with floating values depicting a regulatory chromatin environment (details below).
My goal is to find out if some genes have significant differential regulatory chromatin environment between my time points, for example tpA vs. tpCtrl. In addition I would also need to add a qualitative covariate to the design.
My scores distribution is more or less similar to a single cell RNA expriment but my sparse matrix is not composed of integer raw counts but floating numbers.
I think it would still make sense to use a negative binomial distribution which would also capture the over dispersion of my data.
Unfortunatly, pseudocounts methods like EdgeR
or DESeq2
, do not allow modified or normalized counts as starting point of the differential analysis.
In a simple manner, I looked into the FindMarkers
from Seurat to calculate foldChange by a rowMeans function, however I am not able to include any covariate in this method.
Would ANCOVA gene by gene, followed by a Bonferroni correction, be a potential way to go ?
Scenic+ might be able to answer this question, but I have heard it is rather slow and eating a lot of RAM.
The "gene regulatory chromatin" score calculation :
Following this paper :
Using both RNA and ATAC, find out peaks correlated with the expression of nearby genes which outputs a list of links between peaks and gene with a z-score and p-value.
First, for each cell, the number of reads falling in each peak present in the list of links previously described (pvalue <0.05), will be divided by the number of reads falling into all the peaks in that given cell.
Which create a normalized peak matrix. (peaks x cells with normalized accessibility)
Second, for each cell and each gene present in the list of links, all peaks values associated to that gene are summed (from the normalized peak matrix).
For the sake of readability all values are multiply by a scaling factor of 10000.
Which create a normalized gene score matrix. (genes x cells with normalized gene score)
Scores are ranging from 0 to 130
Follow-up question: is there a reason you prefer log-pseudocount to a variance-stabilizing transformation like the DESeq2 VST or limma::voom ? I tend to treat the peak counts akin to gene counts and take the pseudobulk (one for each sample/condition) within each contrast and throw these into edgeR/limma. In your experience, is this approach inappropriate for single-cell ATAC?
During Linnorm's development, I experimented with many ideas and methods that are similar to edgeR or DESeq, which didn't convert the data to log-pseudocount. However, I couldn't find a distribution (like Poisson, Negative Binomial, etc.) that fit the data well, especially with scRNA-seq. Later on, I found that transforming the data into Gaussian distribution (log-pseudocount) was an easier, computationally faster task and it also produced better results. Therefore, Linnorm went with this methodology.
edgeR/limma didn't consider some issues encountered by single cell read count data (like RNA-seq or ATAC-seq). However, they should work fine. Nevertheless, many review articles found that using single cell specific tool would result in better accuracy.
Thanks for the kind reply, I am actually performing a transformation on the gene/peak counts before any differential analysis tool. I don't have any counts to play with anymore, for each gene in each cell I have a score and I want to compare this gene score between different time points.
Let me directly comment on the methodology to make it clearer.
Using both RNA and ATAC, find out peaks correlated with the expression of nearby genes which outputs a list of links between peaks and gene with a z-score and p-value.
First, for each cell, the number of reads falling in each peak present in the list of links previously described (pvalue <0.05), will be divided by the number of reads falling into all the peaks in that given cell.
Which create a normalized peak matrix. (peaks x cells with normalized accessibility)
Second, for each cell and each gene present in the list of links, all peaks values associated to that gene are summed (from the normalized peak matrix).
For the sake of readability all values are multiply by a scaling factor of 10000.
Which create a normalized gene score matrix. (genes x cells with normalized gene score)
Scores are ranging from 0 to 130