Question

Difference in number of DEGs from Deseq2 and limma-voom

0

Entering edit mode

2.4 years ago

dzisis1986 ▴ 70

Hello,

I have RNA-seq data from two different treatment groups (F and NF ) at 2 different time points (T1 and T2). The mapping was done with STAR aligner and the quantification was done with FeatureCounts. I run differential expression with Deseq2 and limma-voom and the number of DEGs is very different between the two methods. In the comparison between F : T2 vs T1 i have 40 DEGs with Deseq2 whereas I get 3302 DEGs with limma by using the same covariates.

The same issue is encountered also in comparisson NF: T2 vs T1 where Deseq2 gives 30 DEGs whereas limma 3844. From a biological prespective we anticipate more DEGs in the comparison of F : T2 vs T1. So Deseq2 seems to follow this trend but with very few DEGs whereas limma returns much more hits in NF: T2 vs T1 comparison.

Do you think that this huge difference in the number of the DEGs from each tool is explained (different apllied algorithms/methods) or something may be wrong with the input data?

Thank you

RNAseq Deseq2 limma • 1.6k views

ADD COMMENT • link 2.4 years ago by dzisis1986 ▴ 70

0

Entering edit mode

Without code it is hard to comment. Two methods will never agree 100%, but this difference suggests that analysis code in at least one or both is flawed or at least not done in a way that a like-for-like comparison can be done. Please add relevant code.

ADD REPLY • link 2.4 years ago by ATpoint 85k

0

Entering edit mode

I am using as input counts and the same covariates. I will try to make the code better and paste it. thank you

ADD REPLY • link 2.4 years ago by dzisis1986 ▴ 70

0

Entering edit mode

I am using a count matrix made after mapping with STAR and quantification with featureCounts. I also filtered the lowly expressed genes based on a threshold of FPKM (<1 in the 20% of the sample) that's why the count matrix has this name.
My count matrix looks like this example:

                001_T1 001_T2 002_T1 002_T2 003_T1 003_T2 004_T1 004_T2 005_T1 005_T2 006_T1 006_T2 009_T1 009_T2 010_T1 010_T2 011_T1
ENSG00000186827     59     68    214    248    175    250    222    269    119    143    236    268    125    144    129    117    159
ENSG00000186891     50     61    136    130    140    185    121    111    128    164    182    120    127    111    183    156    220

My covariates table looks like this example :

 sampleName Individ age sex fasting timePoint medianIsize    GC   BMI BMI_category extraction_batch_ID extraction_date plate RIN
001_T1   ID001  54   F       F        T1       427.0 57.74 34.63            4                   1        19/02/19     1 9.2
001_T2   ID001  54   F       F        T2       442.0 59.10 32.67            4                 101        31/07/19     9 8.4
002_T1   ID002  65   F       F        T1       411.0 57.08 31.13            4                   1        19/02/19     1 6.2
002_T2   ID002  65   F       F        T2       434.0 57.20 31.80            4                 101        31/07/19     9 8.9
003_T1   ID003  64   F       F        T1       430.0 58.14 31.24            4                   2        19/02/19     1   7
003_T2   ID003  64   F       F        T2       423.0 58.17 29.47            3                 100        31/07/19     8 7.5
007_T1   ID007  34   M      NF        T1       434.0 56.94 27.64            3                   4        20/02/19     1 8.0
007_T2   ID007  34   M      NF        T2       415.0 56.80 28.21            3                  60        18/04/19     6 8.1

The code for each method using the same count matrix and the same covariates:

Limma:

library(edgeR)
library(limma)
library(tibble)
library(dplyr)
library("BiocParallel")
register(MulticoreParam(12))
#Load Counts
counts_fastbio <- read.table(file = "counts_fastbio_Filt_fpkm_1_20_matrix_all.txt", sep = '\t', header = TRUE, row.names="", check.names = FALSE)
#Load info
info_fastbio <- read.csv("metadata_all_extras.txt", header=TRUE, sep=",",comment.char="", stringsAsFactors=FALSE, check.names = FALSE)
paste("My final table has",nrow(counts_fastbio),"genes and", ncol(info_fastbio), "covariates")
#Have Sanger sampleName as row names for the limma package after
info_fastbio <- info_fastbio %>% column_to_rownames("sampleName")
#subgroup to F NF
info_fastbio <- info_fastbio[which(info_fastbio$fasting == "F"),]
counts_fastbio <- counts_fastbio[,which(colnames(counts_fastbio)%in% row.names(info_fastbio))]

#Check if all samples are present in the info
length(which(rownames(info_fastbio) %in% colnames(counts_fastbio)))

#Make sure the variables of interest are factors
info_fastbio$Individ <- as.factor(info_fastbio$Individ)
info_fastbio$timePoint <- as.factor(info_fastbio$timePoint)
info_fastbio$fasting <- as.factor(info_fastbio$fasting)
info_fastbio$sex <- as.factor(info_fastbio$sex)

#Create a design matrix including the surrongate variables
design <- model.matrix(~info_fastbio$Individ + info_fastbio$timePoint + info_fastbio$medianIsize + info_fastbio$GC)

#Save in a DGEList and calculate normalisation factors for library size
counts_fastbio <- DGEList(counts_fastbio)
counts_fastbio  <- calcNormFactors(counts_fastbio)
#Apply voom transformation
counts_voom <- voom(counts_fastbio, design = design, plot = TRUE)
#Fit limma linear model
counts_voom_lmfit <- lmFit(counts_voom, design, parallel=T)
# compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a common value.
counts_voom_efit <- eBayes(counts_voom_lmfit)
#Extract a table of the top-ranked genes from a linear model fit.
design_table <- topTable(counts_voom_efit, coef="timePointT2",adjust.method="fdr", confint=TRUE, n=Inf)
design_table <- cbind(rownames(design_table), design_table)
colnames(design_table)[1] <- "GENEID"
plotSA(counts_voom_efit)
summary(decideTests(counts_voom_efit))

Deseq2:

library("DESeq2")
library(ggplot2)
library("BiocParallel")
register(MulticoreParam(12))
cts_dataq <- read.table(file = "counts_fastbio_Filt_fpkm_1_20_matrix_all.txt", sep = '\t', header = TRUE, row.names="", check.names = FALSE)
head(cts_dataq)
dim(cts_dataq)
coldata <- read.table('metadata_all_extras.txt', header = TRUE, sep=",")
dim(coldata)
rownames(coldata)=coldata$sampleName
#Select F or NF for metadata and counts 
coldata <- coldata[which(coldata$fasting == "F"),]
cts_dataq <- cts_dataq[,which(colnames(cts_dataq)%in% row.names(coldata))]
dim(cts_dataq)

#take integers from the data matrix
cts <- trunc(cts_dataq, 2)

#coldata <- DataFrame(coldata)
coldata$sampleName <- factor(coldata$sampleName)
coldata$sex <- factor(coldata$sex)
coldata$fasting <- factor(coldata$fasting)
coldata$timePoint <- factor(coldata$timePoint)
coldata$Individ <- factor(coldata$Individ)
coldata$BMI_category <- factor(coldata$BMI_category)
coldata$age <- factor(coldata$age)
coldata
head(cts,2)

# make the columns of the count matrix and the rows of the column data (information about samples) in the same order.
rownames(coldata) <- sub("fb", "", rownames(coldata))
all(rownames(coldata) %in% colnames(cts))
all(rownames(coldata) == colnames(cts))
cts <- cts[, rownames(coldata)]
all(rownames(coldata) == colnames(cts))
df_na <- cts[complete.cases(cts), ]
coldata[complete.cases(coldata), ]
#Create the dds object for differential expression 
dds2 <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ Individ + medianIsize + GC + timePoint)
dds2 <- DESeq(dds2, parallel=T)

The results for the paired analysis F : T2 Vs T1 for Deseq2 with padjustes < 0.05 are: 26 DEGs and for limma with padjustes < 0.05 : 3041 (1722 UP and 1319 DOWN).

How do you explain this huge difference? I tried also to use contrast or different design order and each time I have different results in limma.

ADD REPLY • link 2.4 years ago by dzisis1986 ▴ 70