Hello everyone,
I am trying to do an enrichment analysis of Arabidopsis data, however I am still wondering how to build it or what to use as a background (universe), could you guide me? I am working with this example.
diff_genes <- read_delim(file = "differential_genes.tsv", delim = "\t")
# library("biomartr") (if not loaded already)
biomartr::organismBM(organism = "Arabidopsis thaliana")
arabido_attributes =
biomartr::organismAttributes("Arabidopsis thaliana") %>%
filter(dataset == "athaliana_eg_gene")
arabido_attributes
attributes_to_retrieve = c("tair_symbol", "entrezgene_id")
result_BM <- biomartr::biomart( genes = diff_genes$genes, # genes were retrieved using biomartr::getGenome()
mart = "plants_mart", # marts were selected with biomartr::getMarts()
dataset = "athaliana_eg_gene", # datasets were selected with biomartr::getDatasets()
attributes = attributes_to_retrieve, # attributes were selected with biomartr::getAttributes()
filters = "ensembl_gene_id" )# query key
head(result_BM)
# building the universe!
all_arabidopsis_genes <- read.delim("counts.txt", header = T, stringsAsFactors = F)[,1] # directly selects the gene column
# we want the correspondence of TAIR/Ensembl symbols with NCBI Entrez gene ids
attributes_to_retrieve = c("tair_symbol", "uniprotswissprot","entrezgene_id")
# Query the Ensembl API
all_arabidopsis_genes_annotated <- biomartr::biomart(genes = all_arabidopsis_genes,
mart = "plants_mart",
dataset = "athaliana_eg_gene",
attributes = attributes_to_retrieve,
filters = "ensembl_gene_id" )
# for compatibility with enrichGO universe
# genes in the universe need to be characters and not integers (Entrez gene id)
all_arabidopsis_genes_annotated$entrezgene_id = as.character(
all_arabidopsis_genes_annotated$entrezgene_id)
Agreed! And OP should review the following as well: Urgent need for consistent standards in functional enrichment analysis
Thank you very much for your valuable answer, so could I use the genes from my counts matrix after filtering?
library(DESeq2)
Setting the Path to the FeatureCounts Count Matrix
setwd("C:/project_2023/quantification_featureCounts")
Viewing Files in the Directory
list.files()
Reading the Count Matrix
countData <- read.delim("./matriz_arabidopsis_2023.txt", header = TRUE, row.names = 1) head(countData)
Description of Samples
condition <- factor(c("Control", "Control", "Control", "Treatment", "Treatment", "Treatment")) colData <- data.frame(row.names = colnames(countData), condition) head(colData)
Creating a DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition) dds
Filtering Genes with Very Low Expression (less than 10 reads)
dds <- dds[rowSums(counts(dds)) > 10, ]
dds
Performing Standard Differential Expression Analysis
dds <- DESeq(dds) dds
Extracting Results
res <- results(dds) res
Summary of Differential Gene Expression
summary(res)
Sorting the Summary List by Adjusted p-value
res <- res[order(res$padj), ] head(res)
Yes exactly
Your universe would be: ALL_GENE_IDS <- row.names(res)