I am trying to perform a more faster way of differential analysis between KO and WT single cell data. I have intergated the data using Seurat
pipeline and the differential expression using DESeq2
Seurat
implementation. As I have about 4k cells in each clusters, it is taking forever. Meanwhile I found the edgeR
and make it working based on consolidating many biostar discussions and the edgeR
manual as below
## subset Seurat object for the cluster
seurat_object_cellfor_edgeR <- subset(seurat_object, subset = seurat_cluster == 0)
# count matrix
counts <- as.matrix(seurat_object_cellfor_edgeR@assays$RNA@counts)
counts <- counts[Matrix::rowSums(counts >= 1) >= 1, ]
# subset the meta data fro filtered gene/cells
metadata <- seurat_object_cellfor_edgeR@meta.data
metadata <- metadata[,c("nCount_RNA", "cell_info")]
metadata <- metadata[colnames(counts),]
### Make single cell experiment
sce <- SingleCellExperiment(assays = counts,
colData = metadata)
## convert to edgR object
dge <- convertTo(sce, type="edgeR", assay.type = 1)
meta_dge <- dge$samples
meta_dge <- meta_dge[,c("lib.size","norm.factors")]
meta_dge <- cbind(meta_dge, metadata)
meta_dge$group <- factor(meta_dge$cell_info)
meta_dge$group <- relevel(meta_dge$group, "KO")
dge$samples <- meta_dge
model fit
dge <- calcNormFactors(dge)
design <- model.matrix(~0+group, data=dge$samples)
dge <- estimateDisp(dge, design = design)
fit <- glmQLFit(dge, design = design)
Differential expression test
qlf_as.is <- glmQLFTest(fit)
# top genes
headqlf_as.is$table)
logFC logCPM F PValue
AL627309.1 -15.82462 8.083615 1355354.8 0
AL669831.5 -15.42607 8.111567 444435.6 0
LINC00115 -15.76849 8.087425 576319.8 0
FAM41C -15.66446 8.094247 501766.5 0
AL645608.3 -15.76752 8.087299 591389.5 0
AL645608.1 -15.83204 8.083359 1661128.7 0
and found to be all the genes are differentially regulated, no FDR and all p value is zero. So I explored the contrast for the testing and that gave me much more reasonable and expected results (but the FC is too low!)
my.contrasts <- makeContrasts(KO_vs_WT = groupKO-groupWT, levels=design)
qlf.contrast <- glmQLFTest(fit, contrast=my.contrasts[,"KO_vs_WT"])
head(qlf.contrast$table)
logFC logCPM F PValue
AL627309.1 -0.02402289 8.083615 36.0058677 0.3631278
AL669831.5 -0.16083392 8.111567 6.8890345 0.1892858
LINC00115 0.02784262 8.087425 2.5886375 0.6827443
FAM41C -0.02251330 8.094247 0.4922016 0.8021113
AL645608.3 -0.02544709 8.087299 2.4978086 0.6936892
AL645608.1 0.03523809 8.083359 48.3970011 0.3426162
With this back ground my questions are
- Which one of this method is correct? If none of them, how to perform the differential testing?.
- Why the FDR is not present in any one of the model. Am I missing something?
- Can I include mitochondria read percentage as a covariant and is it require to include UMI count for the model?
Any help is highly appreciated!.
Can you post a snippet of what your
design
looks like? and alsomy.contrasts
. I think without knowing what your design/test is, it's hard to get an understanding.For starts, the
glmQLFTest
requires you to specify either acoef
or acontrast
.eg:
glmQLFTest(fit, contrast=my.contrasts)
(there's no need to subset themy.contrasts
object)I think this is the reason why no fdr/pvalue is given. edgeR is unsure what to test.
Final point, can you try
topTags
function to view the results instead of head? Not sure if it will make a difference but this is the method I always use (and is noted in the manual)Thanks. Yes, the
topTags
gives the FDR, but I am not sure why I am getting too low fold changeThank you. Your design and contrasts are fine and you should perform the glmQLFTest as described above. Regarding your question about FC, not 100% sure sorry it's outside of my expertise. You should read these to get a better understanding: https://support.bioconductor.org/p/108762/ and https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/predFC
I take the logFC values produced by edgeR at face value, that is if you average the counts and take the difference between groups (and factor in lib differences). Of course as noted in by Prof Smyth, it's much more complicated than simply adding and dividing. In other words, you have small logFC values because the expression differences between those genes is small. Don't get fixated on the logFC values of genes. You should consider the overall "picture" of the expression profile between your groups. A small fold change can still have a remarkable effect on the biological system you're studying.