Question

Choosing the Best Method for DEG Analysis: limma vs. DESeq2 for RSEM Normalized RNA-Seq Data

0

Entering edit mode

7 months ago

Joe Kherery ▴ 140

Hello everyone,

I'm working with mRNA expression data from cBioPortal: "RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)," and I am trying to identify differentially expressed genes (DEGs). I would like to know whether it is better to use limma or DESeq2 for this type of data.

Here is the code I've been using with limma:

# Load packages
library(ggplot2)
library(tidyverse)
library(limma)

# Load data

skcm_tcga <- read.delim("data_mrna_seq_v2_rsem.txt")

# skcm_tcga_pan_can_atlas_2018 ####
# cancer_study_identifier: skcm_tcga_pan_can_atlas_2018
#stable_id: rna_seq_v2_mrna
#genetic_alteration_type: MRNA_EXPRESSION
#datatype: CONTINUOUS
#show_profile_in_analysis_tab: FALSE
#profile_name: mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
#profile_description:mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
# data_filename: data_mrna_seq_v2_rsem.txt2)



# Filter rows where Hugo_Symbol is not empty and remove duplicates
skcm_tcga <- skcm_tcga %>%
  filter(Hugo_Symbol != "") %>%
  select(-Entrez_Gene_Id) %>%
  distinct(Hugo_Symbol, .keep_all = TRUE) %>%
  column_to_rownames(var = "Hugo_Symbol")

# Apply log2 transformation
skcm_tcga_log2 <- skcm_tcga %>%
  mutate(across(everything(), ~ log2(. + 1)))

# Load clinical data
sample_annotation <- read.delim("sample_annotation.txt", comment.char="#")

# Create a factor for the groups
group_factor <- factor(sample_annotation$Age_Group)

# Create the design matrix
design_matrix <- model.matrix(~0 + group_factor)
colnames(design_matrix) <- levels(group_factor)

# Fit the linear model using lmFit
fit <- lmFit(skcm_tcga_log2, design_matrix)

# Create a contrast matrix
contrast_matrix <- makeContrasts(
  elderly_vs_adult = elderly - adult,
  levels = design_matrix
)

# Apply contrasts to the model
fit2 <- contrasts.fit(fit, contrast_matrix)

# Compute statistics using eBayes
fit2 <- eBayes(fit2)

# Get and save results
res <- topTable(fit2, coef="elderly_vs_adult", number=Inf, adjust.method="BH")
res2 <- rownames_to_column(res, var = "gene_symbol")
write.table(res2, file="results_idoso_vs_adulto.txt", sep="\t", quote=FALSE)

My questions are:

For the "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" data, is limma the best approach to identify DEGs, or would DESeq2 be more suitable?
Should I consider adding any filters or other implementations to my code to improve the accuracy of the DEG detection? For example, should I apply an expression filter to remove low-expressed genes before running the analysis, or are there any other recommended preprocessing steps?

Any advice would be greatly appreciated. Thanks in advance!

dese2 cbioportal RNA-seq limma • 975 views

ADD COMMENT • link updated 7 months ago by Istvan Albert 102k • written 7 months ago by Joe Kherery ▴ 140

score 2 · Answer 1 · 2024-09-28

2

Entering edit mode

7 months ago

Istvan Albert 102k

DeSeq2 was designed for raw counts data, limma was designed for normalized data, that being said I think both would work, but maybe the latter is more appropriate.

People often add many other steps to filter their data, and usually, they do so when the first attempt does not quite produce what they expected to see. It is a delicate balance between exploratory analysis and p-hacking.

ADD COMMENT • link 7 months ago by Istvan Albert 102k

0

Entering edit mode

Dear @Istvan Albert

Based on this code I found, I need to do a part of normalization and also a part of filtering to use this data in limma, right?

y <- normalizeQuantiles(exprs(TCGA.RNASeqV2_eset))

keep <- rowSums(y > log2(11)) >= 14

REF.: https://support.bioconductor.org/p/91054/

> library(curatedCRCData)
> data(TCGA.RNASeqV2_eset)
> targets <- pData(TCGA.RNASeqV2_eset)
> y <- normalizeQuantiles(exprs(TCGA.RNASeqV2_eset))
> type <- factor(targets$sample_type)
> table(type)
type
adjacentnormal          tumor 
            14            181 
> keep <- rowSums(y > log2(11)) >= 14
> table(keep)
keep
FALSE  TRUE 
 4209 16293 
> y2 <- y[keep,]
> design <- model.matrix(~type)
> fit <- lmFit(y2,design)
> fit <- eBayes(fit,robust=TRUE,trend=TRUE)
> topTable(fit,coef=2)

ADD REPLY • link 7 months ago by Joe Kherery ▴ 140

0

Entering edit mode

Fundamentally these types of filterings come down to the effect they have on the data.

Does your filtering change the data, does it change it radically? Is it sensitive to wether you are doing 10 or 14 samples.

Ideally the choice of parameters is not that substantial - you would get similar results no matter what type of filtering you choose as long as you remove the majority of empty rows. In which case you might just use what others recommend (just make sure the 14 actually makes sense in your case).

ADD REPLY • link 7 months ago by Istvan Albert 102k