Hello everyone,
I'm working with mRNA expression data from cBioPortal: "RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)," and I am trying to identify differentially expressed genes (DEGs). I would like to know whether it is better to use limma or DESeq2 for this type of data.
Here is the code I've been using with limma:
# Load packages
library(ggplot2)
library(tidyverse)
library(limma)
# Load data
skcm_tcga <- read.delim("data_mrna_seq_v2_rsem.txt")
# skcm_tcga_pan_can_atlas_2018 ####
# cancer_study_identifier: skcm_tcga_pan_can_atlas_2018
#stable_id: rna_seq_v2_mrna
#genetic_alteration_type: MRNA_EXPRESSION
#datatype: CONTINUOUS
#show_profile_in_analysis_tab: FALSE
#profile_name: mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
#profile_description:mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
# data_filename: data_mrna_seq_v2_rsem.txt2)
# Filter rows where Hugo_Symbol is not empty and remove duplicates
skcm_tcga <- skcm_tcga %>%
filter(Hugo_Symbol != "") %>%
select(-Entrez_Gene_Id) %>%
distinct(Hugo_Symbol, .keep_all = TRUE) %>%
column_to_rownames(var = "Hugo_Symbol")
# Apply log2 transformation
skcm_tcga_log2 <- skcm_tcga %>%
mutate(across(everything(), ~ log2(. + 1)))
# Load clinical data
sample_annotation <- read.delim("sample_annotation.txt", comment.char="#")
# Create a factor for the groups
group_factor <- factor(sample_annotation$Age_Group)
# Create the design matrix
design_matrix <- model.matrix(~0 + group_factor)
colnames(design_matrix) <- levels(group_factor)
# Fit the linear model using lmFit
fit <- lmFit(skcm_tcga_log2, design_matrix)
# Create a contrast matrix
contrast_matrix <- makeContrasts(
elderly_vs_adult = elderly - adult,
levels = design_matrix
)
# Apply contrasts to the model
fit2 <- contrasts.fit(fit, contrast_matrix)
# Compute statistics using eBayes
fit2 <- eBayes(fit2)
# Get and save results
res <- topTable(fit2, coef="elderly_vs_adult", number=Inf, adjust.method="BH")
res2 <- rownames_to_column(res, var = "gene_symbol")
write.table(res2, file="results_idoso_vs_adulto.txt", sep="\t", quote=FALSE)
My questions are:
For the "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" data, is limma the best approach to identify DEGs, or would DESeq2 be more suitable?
Should I consider adding any filters or other implementations to my code to improve the accuracy of the DEG detection? For example, should I apply an expression filter to remove low-expressed genes before running the analysis, or are there any other recommended preprocessing steps?
Any advice would be greatly appreciated. Thanks in advance!
Dear @Istvan Albert
Based on this code I found, I need to do a part of normalization and also a part of filtering to use this data in limma, right?
REF.: https://support.bioconductor.org/p/91054/
Fundamentally these types of filterings come down to the effect they have on the data.
Does your filtering change the data, does it change it radically? Is it sensitive to wether you are doing 10 or 14 samples.
Ideally the choice of parameters is not that substantial - you would get similar results no matter what type of filtering you choose as long as you remove the majority of empty rows. In which case you might just use what others recommend (just make sure the 14 actually makes sense in your case).