Hello everyone,
I am trying to do an enrichment analysis of Arabidopsis data, however I am still wondering how to build it or what to use as a background (universe), could you guide me? I am working with this example.
diff_genes <- read_delim(file = "differential_genes.tsv", delim = "\t")
# library("biomartr") (if not loaded already)
biomartr::organismBM(organism = "Arabidopsis thaliana")
arabido_attributes =
biomartr::organismAttributes("Arabidopsis thaliana") %>%
filter(dataset == "athaliana_eg_gene")
arabido_attributes
attributes_to_retrieve = c("tair_symbol", "entrezgene_id")
result_BM <- biomartr::biomart( genes = diff_genes$genes, # genes were retrieved using biomartr::getGenome()
mart = "plants_mart", # marts were selected with biomartr::getMarts()
dataset = "athaliana_eg_gene", # datasets were selected with biomartr::getDatasets()
attributes = attributes_to_retrieve, # attributes were selected with biomartr::getAttributes()
filters = "ensembl_gene_id" )# query key
head(result_BM)
# building the universe!
all_arabidopsis_genes <- read.delim("counts.txt", header = T, stringsAsFactors = F)[,1] # directly selects the gene column
# we want the correspondence of TAIR/Ensembl symbols with NCBI Entrez gene ids
attributes_to_retrieve = c("tair_symbol", "uniprotswissprot","entrezgene_id")
# Query the Ensembl API
all_arabidopsis_genes_annotated <- biomartr::biomart(genes = all_arabidopsis_genes,
mart = "plants_mart",
dataset = "athaliana_eg_gene",
attributes = attributes_to_retrieve,
filters = "ensembl_gene_id" )
# for compatibility with enrichGO universe
# genes in the universe need to be characters and not integers (Entrez gene id)
all_arabidopsis_genes_annotated$entrezgene_id = as.character(
all_arabidopsis_genes_annotated$entrezgene_id)
Agreed! And OP should review the following as well: Urgent need for consistent standards in functional enrichment analysis
Thank you very much for your valuable answer, so could I use the genes from my counts matrix after filtering?
Yes exactly
Your universe would be: ALL_GENE_IDS <- row.names(res)
Hi both @ultraanfibio and @sansan96
It's been 1 year for this post, but just to confirm, at the end, did you choose the background as the genes_list
1- before this filtering (dds <- dds[rowSums(counts(dds)) > 10, ])
2- Or those ones passed the fiteration?
cheers,
Hello, in my case I used my count matrix (raw).
Looks like:
Thank you for your reply!
The main goal of my question is to clarify a common point of discussion: some suggest using only the list of genes that pass the filtering criteria—such as genes with CPM > 1. This approach is often justified by the argument that "only genes expressed under both conditions" should be considered relevant for the background.
In contrast, using all genes in the count matrix as the background (i.e., without applying any filtering), as you mentioned, includes genes regardless of their expression levels.
Additionally, I have another question I’d appreciate your insights on: Did you use all DEGs in your ClusterProfiler ORA in one shot, or did you input the upregulated and downregulated DEGs separately?
Thanks again!
I agree with your comments. For my ClusterProfiler analysis I used my up and down list independently after my analysis with DESeq2.
Great, thanks :)
Hi san96,
Sorry for requesting another question, but what is the criteria that should be followed in case the number of DEGs is very low : upregulated are only 40 :
# Perform KEGG Overrepresentation Analysis (ORA) using clusterProfiler enrichment_kegg <- enrichKEGG( interesting_set_kegg, organism = "ko", # KEGG organisms use "ko" for KEGG Orthology keyType = "kegg", # KEGG orthologs are referenced by "kegg" pvalueCutoff = 0.05, pAdjustMethod = "BH", # Use Benjamini-Hochberg correction universe = background_kegg, # Background genes for the ORA minGSSize = 10, # Minimum gene set size for analysis maxGSSize = 500, # Maximum gene set size for analysis qvalueCutoff = 0.05, use_internal_data = FALSE