I'm still a novice R user, but with the use of Seurat vignettes, Harvard Chan Bioinformatics tutorials, and OSCA, I've been able to learn quite a bit about single-cell analysis and can reasonably work my way from raw fastq files to cluster annotation. However, I'm struggling with going from a merged Seurat object (containing replicates from two conditions) to generating a DESeq2 object to perform DEG analysis across conditions (between specific clusters). I've been going through the following tutorial:
It's been excellent, but I don't think my understanding of base R is good enough to understand every step. I'm teaching myself more at this time, but is there a simpler way to get from a Seurat object to DESeq2 analysis? Or any example scripts anyone can provide? If so, I'd greatly appreciate it!
Part of the reason I want to do it this way rather than the FindMarkers() function is for processing time:
I agree that in my experience and from what I read from others pseudobulk appears to be more robust as summing cells eliminates sparseness of data and therefore (I think) makes fold change estimations more robust/meaningful. I use scuttle for the aggregation. Here is an example to get familiar which starts from a SingleCellExperiment. There is a conversion vignette to convert Seurat to SingleCellExperiment which is probably the easiest here.
library(scuttle)
#/ example data and some mock clusters:
set.seed(123)
sce <- mockSCE()
sce$Cluster <- sample(LETTERS[1:5], ncol(sce), replace=TRUE)
#/ we have several clusters and two treatments
colData(sce)
DataFrame with 200 rows and 4 columns
Mutation_Status Cell_Cycle Treatment Cluster
<character> <character> <character> <character>
Cell_001 negative S treat1 A
Cell_002 positive S treat1 D
Cell_003 negative G0 treat2 D
Cell_004 negative G1 treat2 A
Cell_005 positive G2M treat1 D
... ... ... ... ...
Cell_196 positive G0 treat2 D
Cell_197 negative G2M treat1 E
Cell_198 negative S treat1 C
Cell_199 negative G2M treat1 A
Cell_200 negative G1 treat1 B
#/ aggregate by cluster and treatment:
sum_by <-
c("Cluster", "Treatment")
summed <-
aggregateAcrossCells(sce, id=colData(sce)[,sum_by])
#/ add rownames using the information from the colData:
colnames(summed) <-
apply(colData(summed)[,sum_by], 1, function(x) paste(x, collapse="_"))
head(assay(summed, "counts"))
A_treat1 A_treat2 B_treat1 B_treat2 C_treat1 C_treat2 D_treat1 D_treat2
Gene_0001 218 695 667 384 383 512 809 307
Gene_0002 7029 7228 7088 6638 4355 7605 8013 5978
Gene_0003 251 354 254 68 209 612 389 106
Gene_0004 7586 5130 4950 3904 2640 4528 11968 10401
Gene_0005 19564 13940 10152 17015 10612 15363 18375 19609
Gene_0006 6024 3261 6874 4347 1411 3848 4875 2932
E_treat1 E_treat2
Gene_0001 319 803
Gene_0002 7151 7435
Gene_0003 168 301
Gene_0004 7263 7271
Gene_0005 14327 10150
Gene_0006 6739 4977
This is now a SummarizedExperiment with the raw pseudobulk counts that can go into DESeq2.
For a matrix of raw counts rather than the SummarizedExperiments use:
Do you want to run a pseudobulk analysis or treat each single cell as a replicate?
Pseudobulk analysis. For example, I have 10 samples - 5 control and 5 disease, with approx. 5000 to 8000 cells per sample.
Everything I've learned/read about single-cell analysis so far says that we should not be treating each cell as a replicate.