Question

EdgeR for single cell differential expression analysis

0

Entering edit mode

5.7 years ago

kvshamsudheen ▴ 120

I am trying to perform a more faster way of differential analysis between KO and WT single cell data. I have intergated the data using Seurat pipeline and the differential expression using DESeq2 Seurat implementation. As I have about 4k cells in each clusters, it is taking forever. Meanwhile I found the edgeR and make it working based on consolidating many biostar discussions and the edgeR manual as below

## subset Seurat object for the cluster     
seurat_object_cellfor_edgeR <- subset(seurat_object, subset = seurat_cluster == 0)

# count matrix 
counts <- as.matrix(seurat_object_cellfor_edgeR@assays$RNA@counts)
counts <- counts[Matrix::rowSums(counts >= 1) >= 1, ]
# subset the meta data fro filtered gene/cells
metadata <- seurat_object_cellfor_edgeR@meta.data
metadata <- metadata[,c("nCount_RNA", "cell_info")]
metadata <- metadata[colnames(counts),]

### Make single cell experiment
sce <- SingleCellExperiment(assays = counts, 
                                    colData = metadata)

## convert to edgR object
dge <- convertTo(sce, type="edgeR", assay.type = 1)
meta_dge <- dge$samples
meta_dge <- meta_dge[,c("lib.size","norm.factors")]
meta_dge <- cbind(meta_dge, metadata)
meta_dge$group <- factor(meta_dge$cell_info)
meta_dge$group <- relevel(meta_dge$group, "KO")
dge$samples <- meta_dge

model fit

dge <- calcNormFactors(dge)
design <- model.matrix(~0+group, data=dge$samples)
dge <- estimateDisp(dge, design = design)    
fit <- glmQLFit(dge, design = design)

Differential expression test

qlf_as.is <- glmQLFTest(fit)
# top genes
 headqlf_as.is$table)
               logFC   logCPM         F PValue
AL627309.1 -15.82462 8.083615 1355354.8      0
AL669831.5 -15.42607 8.111567  444435.6      0
LINC00115  -15.76849 8.087425  576319.8      0
FAM41C     -15.66446 8.094247  501766.5      0
AL645608.3 -15.76752 8.087299  591389.5      0
AL645608.1 -15.83204 8.083359 1661128.7      0

and found to be all the genes are differentially regulated, no FDR and all p value is zero. So I explored the contrast for the testing and that gave me much more reasonable and expected results (but the FC is too low!)

my.contrasts <- makeContrasts(KO_vs_WT = groupKO-groupWT, levels=design)
qlf.contrast <- glmQLFTest(fit, contrast=my.contrasts[,"KO_vs_WT"])
head(qlf.contrast$table)
                 logFC   logCPM          F    PValue
AL627309.1 -0.02402289 8.083615 36.0058677 0.3631278
AL669831.5 -0.16083392 8.111567  6.8890345 0.1892858
LINC00115   0.02784262 8.087425  2.5886375 0.6827443
FAM41C     -0.02251330 8.094247  0.4922016 0.8021113
AL645608.3 -0.02544709 8.087299  2.4978086 0.6936892
AL645608.1  0.03523809 8.083359 48.3970011 0.3426162

With this back ground my questions are

Which one of this method is correct? If none of them, how to perform the differential testing?.
Why the FDR is not present in any one of the model. Am I missing something?
Can I include mitochondria read percentage as a covariant and is it require to include UMI count for the model?

Any help is highly appreciated!.

edgeR Differential Expression Single-Cell • 11k views

ADD COMMENT • link updated 3.6 years ago by bioinformatics ▴ 60 • written 5.7 years ago by kvshamsudheen ▴ 120

0

Entering edit mode

Can you post a snippet of what your design looks like? and also my.contrasts. I think without knowing what your design/test is, it's hard to get an understanding.

For starts, the glmQLFTest requires you to specify either a coef or a contrast.

eg: glmQLFTest(fit, contrast=my.contrasts) (there's no need to subset the my.contrasts object)

I think this is the reason why no fdr/pvalue is given. edgeR is unsure what to test.

Final point, can you try topTags function to view the results instead of head? Not sure if it will make a difference but this is the method I always use (and is noted in the manual)

ADD REPLY • link 5.7 years ago by Mark ★ 1.7k

0

Entering edit mode

Thanks. Yes, the topTags gives the FDR, but I am not sure why I am getting too low fold change

head(design)
                                          groupKO             groupWT
    AAACCCAAGCGTCTCG                      1                   0
    AAACCCACAAGTTCCA                      1                   0
    AAACCCACAAGTTCGT                      1                   0
    AAACCCACAGCACGAA                      1                   0
    AAACGAAAGACTGTTC                      1                   0
    AAACGAAAGGTCACTT                      1                   0
my.contrasts
                  Contrasts
Levels            KO_vs_WT
  groupWT          -1
  groupKO           1

ADD REPLY • link 5.7 years ago by kvshamsudheen ▴ 120

1

Entering edit mode

Thank you. Your design and contrasts are fine and you should perform the glmQLFTest as described above. Regarding your question about FC, not 100% sure sorry it's outside of my expertise. You should read these to get a better understanding: https://support.bioconductor.org/p/108762/ and https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/predFC

I take the logFC values produced by edgeR at face value, that is if you average the counts and take the difference between groups (and factor in lib differences). Of course as noted in by Prof Smyth, it's much more complicated than simply adding and dividing. In other words, you have small logFC values because the expression differences between those genes is small. Don't get fixated on the logFC values of genes. You should consider the overall "picture" of the expression profile between your groups. A small fold change can still have a remarkable effect on the biological system you're studying.

ADD REPLY • link 5.7 years ago by Mark ★ 1.7k

score 5 · Answer 1 · 2019-11-15

5

Entering edit mode

5.7 years ago

kvshamsudheen ▴ 120

For those who are still looking for the right answer, can have a look at these scripts used for the Soneson and Robinson 2018, Nat. Meth: PMID: 29481549 paper for comparing different DE methods for single cell RNA data set. Both of them are a great start for the same. Cheers!.

ADD COMMENT • link 5.7 years ago by kvshamsudheen ▴ 120

0

Entering edit mode

Hello,

Helpful post. I have a question hopefully you can help. I am comparing differentially expressed genes (DEGs) between WT and KO. When I want to compare DEGs between conditions specific to particular cluster I use Findcluster function of Seurat tool.

Now I wanted to compare overall DEGs between WT and KO and NOT specific to any cluster. Could you let me know how can I accomplish in Seurat or using edgeR. Thank you,

ADD REPLY • link 5.1 years ago by singcell ▴ 10

0

Entering edit mode

Hi, Did you happen to have a workflow for single cell RNA seq differential expression analysis using EdgeR. Thankyou

ADD REPLY • link 3.6 years ago by bioinformatics ▴ 60

score 4 · Answer 2 · 2020-09-19

There was a recent preprint introducing a new tool https://github.com/const-ae/glmGamPoi making the dispersion estimation and model fitting to single-cell data notable faster compared to the current implementations of edgeR / DESeq2. It also implements a quasi-likelihood ratio test very similar to what edgeR does. Might be worth considering as a replacement for edgeR/DESeq2 when working on single-cell since these tools are quite slow due to the large number of cells that have to be considered.