I'm working with some single cell RNA-seq data and have been primarily using Seurat and its various wrappers for other packages to perform my analysis. For differential expression, I have been using the Seurat function FindMarkers()
with the method set to "MAST", which is another R package used for differential expression that Seurat provides a built in option for.
Having done a simple DE analysis with discreet variables (condition1 vs condition2)
, I now want to perform the analysis where the regressor is a continuous variable. The issue is that the FindMarkers()
function only allows for discreet comparisons (ident.1 vs ident.2)
. Therefore by tracking down the code Seurat uses for FindMarkers()
with option "MAST", I've recreated the Seurat method of "MAST" implementation and adjusted it to be compatible with continuous variables.
Seurat (v4.3) FindMarkers()
calculates the log2 fold-change by taking the exponent -1 of the data slot of the Seurat Object to get the library normalized counts, calculates the gene means (rowMeans)
, adds a pseudocount of 1 to each mean and then log2 transforms the result separately for the cells specific to your two conditions, then subtracts the log2(mean) for condition 2 from the log2(mean) for condition 1:
log1pdata.mean.fxn <- function(x) {
return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = 2))
}
data.1 <- log1pdata.mean.fxn(IdentObject@assays$RNA@data[features, cells.1, drop = FALSE])
data.2 <- log1pdata.mean.fxn(IdentObject@assays$RNA@data[features, cells.2, drop = FALSE])
fc <- (data.1 - data.2)
(I know Seurat changes the way logFCs were calculated in v.5, where the pseudocount is now added to the row sums prior to dividing by the number of cells to get the mean, so the pseudocount is now a very small fraction and so logFCs are less deflated, but I am sticking with v4.3 for the sake of consistency.)
MAST also outputs a beta coefficient, that (as I understand it) you can take the exponent of and then log2 transform to get the fold change from the regression rather than this manual calculation.
Q1. Are these two methods of calculating the log2FC ( method 1: manual calculation of log2(mean(condition1)/mean(condition2))
or log2(mean(condition1) - log2(mean(condition2))
and method 2: the log2FC from the beta coefficient ) equally valid? In my case, even for the the analysis with the discreet variables, the log2FCs calculated by these two methods are different, which I assume is fine and normal but I would like some clarification.
Q2. Now, for the case where the regressor is the continuous variable, how do I calculate the log2FC manually (method 1)? I want to use this method because it is what Seurat uses to generate the log2FCs that the FindMarkers()
function returns, and again I want to remain consistent. I understand that in this case the fold change is the stepwise change for each unit of the continuous variable. If log2(mean(condition1)/mean(condition2))
generates the log2 fold change for the discreet variable, is taking the mean of the fold changes for each unit change and log2 transforming the result - log2(mean((mean(unit1)/mean(unit2)), (mean(unit2)/mean(unit3)),(mean(unit3)/mean(unit4)),......))
- the equivalent of this for the continuous variable? Or is there another way to correctly do this calculation?
Q3. I noticed that my manual method (log2(mean((mean(unit1)/mean(unit2)), (mean(unit2)/mean(unit3)),(mean(unit3)/mean(unit4)),......))
) for calculating the log2FC for the continuous variable can differ significantly from the log2FCs generated from the beta coefficients. Is this expected? I'm assuming this is possible if the expression data for a gene has a high variance and the average change in expression per unit of the continuous variable is not unidirectional i.e consistently increasing/decreasing, but again some clarification would be nice.
I would not bother and export the logcounts, and test with limma which supports any design of arbitrary complexity. It will give (moderated) logFCs.
If I was starting the project again perhaps I would but I've already put a fair bit of work into the analysis and visualization of the discrete variable DE results and so I at least want to try and work out if I can do something similar with the continuous variable.
Maybe it's because I hit that wall too often, but these days I prefer robust results obtained by expert-written software over the mental reward of obtaining something that "I figured out myself". After all, you need results. Nobody will reward you for the effort itself (unfortunately).