Question

ORA with clusterProfiler

0

Entering edit mode

14 months ago

san96 ▴ 170

Hello everyone,

I am trying to do an enrichment analysis of Arabidopsis data, however I am still wondering how to build it or what to use as a background (universe), could you guide me? I am working with this example.

diff_genes <- read_delim(file = "differential_genes.tsv", delim = "\t")

# library("biomartr") (if not loaded already)
biomartr::organismBM(organism = "Arabidopsis thaliana")

arabido_attributes = 
  biomartr::organismAttributes("Arabidopsis thaliana") %>% 
  filter(dataset == "athaliana_eg_gene")
arabido_attributes

attributes_to_retrieve = c("tair_symbol", "entrezgene_id")

result_BM <- biomartr::biomart( genes      = diff_genes$genes,                  # genes were retrieved using biomartr::getGenome()
                                mart       = "plants_mart",                     # marts were selected with biomartr::getMarts()
                                dataset    = "athaliana_eg_gene",               # datasets were selected with biomartr::getDatasets()
                                attributes = attributes_to_retrieve,            # attributes were selected with biomartr::getAttributes()
                                filters =   "ensembl_gene_id" )# query key
head(result_BM)  




# building the universe!
all_arabidopsis_genes <- read.delim("counts.txt", header = T, stringsAsFactors = F)[,1] # directly selects the gene column

# we want the correspondence of TAIR/Ensembl symbols with NCBI Entrez gene ids
attributes_to_retrieve = c("tair_symbol", "uniprotswissprot","entrezgene_id")

# Query the Ensembl API
all_arabidopsis_genes_annotated <- biomartr::biomart(genes = all_arabidopsis_genes,
                                                     mart       = "plants_mart",                 
                                                     dataset    = "athaliana_eg_gene",           
                                                     attributes = attributes_to_retrieve,        
                                                     filters =  "ensembl_gene_id" )  

# for compatibility with enrichGO universe
# genes in the universe need to be characters and not integers (Entrez gene id)
all_arabidopsis_genes_annotated$entrezgene_id = as.character(
  all_arabidopsis_genes_annotated$entrezgene_id)

clusterProfiler ORA • 1.7k views

ADD COMMENT • link updated 4 weeks ago by Pegasus ▴ 130 • written 14 months ago by san96 ▴ 170

Ram · Answer 1 · 2023-12-12

2

Entering edit mode

14 months ago

ultraanfibio ▴ 40

You should use as background all GENES that were incuded in the differential expression analysis.

Go to your DGEA results, all genes that pass the predefined cutoffs are considered differentially expressed (DEGs), all genes that were compared are your universe.

Then run the overrepresentation analysis with:

results_ora <- enricher(gene = DEG_GENES, universe = ALL_GENES, TERM2GENE = GENESETS_OF_INTEREST)
results_enrichr = results_ora@result

If you want to convert between gene IDs in the results (e.g., entrez to Gene Symbol) you may want to use setReadable() from DOSE package

ADD COMMENT • link updated 14 months ago by Ram 44k • written 14 months ago by ultraanfibio ▴ 40

1

Entering edit mode

You should use as background all GENES that were incuded in the differential expression analysis.

Agreed! And OP should review the following as well: Urgent need for consistent standards in functional enrichment analysis

ADD REPLY • link 14 months ago by jv ★ 1.8k

0

Entering edit mode

Thank you very much for your valuable answer, so could I use the genes from my counts matrix after filtering?

    # Loading the Library for Differential Expression Analysis
library(DESeq2)


# Setting the Path to the FeatureCounts Count Matrix
setwd("C:/project_2023/quantification_featureCounts")

# Viewing Files in the Directory
list.files()

# Reading the Count Matrix
countData <- read.delim("./matriz_arabidopsis_2023.txt", header = TRUE, row.names = 1)
head(countData)

# Description of Samples
condition <- factor(c("Control", "Control", "Control", "Treatment", "Treatment", "Treatment"))
colData <- data.frame(row.names = colnames(countData), condition)
head(colData)

# Creating a DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition)
dds

# Filtering Genes with Very Low Expression (less than 10 reads)
dds <- dds[rowSums(counts(dds)) > 10, ]

dds

# Performing Standard Differential Expression Analysis
dds <- DESeq(dds)
dds

# Extracting Results
res <- results(dds)
res

# Summary of Differential Gene Expression
summary(res)

# Sorting the Summary List by Adjusted p-value
res <- res[order(res$padj), ]
head(res)z

ADD REPLY • link 6 weeks ago by san96 ▴ 170

1

Entering edit mode

Yes exactly

Your universe would be: ALL_GENE_IDS <- row.names(res)

ADD REPLY • link 14 months ago by ultraanfibio ▴ 40

0

Entering edit mode

Hi both @ultraanfibio and @sansan96

It's been 1 year for this post, but just to confirm, at the end, did you choose the background as the genes_list

1- before this filtering (dds <- dds[rowSums(counts(dds)) > 10, ])

2- Or those ones passed the fiteration?

cheers,

ADD REPLY • link 6 weeks ago by Pegasus ▴ 130

1

Entering edit mode

Hello, in my case I used my count matrix (raw).

Looks like:

> universe_arabidopsis <- read.delim("matriz_arabidopsis_2023.txt", header = TRUE, stringsAsFactors = FALSE)[,1]
#Retrieve the entrezgene_id and description for the universe
>annot_universe <- getBM(
values = universe_arabidopsis,
mart = arabidopsis_mart,
attributes = c(‘ensembl_gene_id’, ‘entrezgene_id’,
‘description’),filters = ‘ensembl_gene_id’)

ADD REPLY • link 6 weeks ago by san96 ▴ 170

0

Entering edit mode

Thank you for your reply!

The main goal of my question is to clarify a common point of discussion: some suggest using only the list of genes that pass the filtering criteria—such as genes with CPM > 1. This approach is often justified by the argument that "only genes expressed under both conditions" should be considered relevant for the background.

In contrast, using all genes in the count matrix as the background (i.e., without applying any filtering), as you mentioned, includes genes regardless of their expression levels.

Additionally, I have another question I’d appreciate your insights on: Did you use all DEGs in your ClusterProfiler ORA in one shot, or did you input the upregulated and downregulated DEGs separately?

Thanks again!

ADD REPLY • link 6 weeks ago by Pegasus ▴ 130

1

Entering edit mode

I agree with your comments. For my ClusterProfiler analysis I used my up and down list independently after my analysis with DESeq2.

ADD REPLY • link 6 weeks ago by san96 ▴ 170

0

Entering edit mode

Great, thanks :)

ADD REPLY • link 5 weeks ago by Pegasus ▴ 130

0

Entering edit mode

Hi san96,

Sorry for requesting another question, but what is the criteria that should be followed in case the number of DEGs is very low : upregulated are only 40 :

# Perform KEGG Overrepresentation Analysis (ORA) using clusterProfiler enrichment_kegg <- enrichKEGG( interesting_set_kegg, organism = "ko", # KEGG organisms use "ko" for KEGG Orthology keyType = "kegg", # KEGG orthologs are referenced by "kegg" pvalueCutoff = 0.05, pAdjustMethod = "BH", # Use Benjamini-Hochberg correction universe = background_kegg, # Background genes for the ORA minGSSize = 10, # Minimum gene set size for analysis maxGSSize = 500, # Maximum gene set size for analysis qvalueCutoff = 0.05, use_internal_data = FALSE

ADD REPLY • link 5 weeks ago by Pegasus ▴ 130

1

Entering edit mode

Hi, I'm not an expert but I'll try to help you. First, what are your criteria for filtering the DEGs? I think you can be more flexible with your filtering if you want to get more DEGs for your next analysis. Another reason for your low number of DEGs is that there might not be many differences between the contrasts you are doing, something more biological. Respect your options, I usually use the default options (pvalueCutoff = 0.05, etc). If there is anything else I can help with, I will be here.

ADD REPLY • link 5 weeks ago by san96 ▴ 170

0

Entering edit mode

Hi san96,

Thanks for your response.

Actuly my question is not about why I got low DEGs number, but instead, what shoud I choose for both minGSSizeand maxGSSize parameters in the clusterprofiler ORA kegg script, with low number of DEGs?

Keeping the default (minGSSize = 10, and maxGSSize** = 500), led to non-ogic results:

for example: in one of the enriched Ko terms in the table below, the geneRatio is very low, but the adjusted p value is significant!! In such case, should I disregard this enriched Ko pathway due to the low GeneRatio, or I should count it and consider it as significant?

Thanks again for your input!

enter image description here

ADD REPLY • link 4 weeks ago by Pegasus ▴ 130