Hello everyone,
I have been tasked to carry out a GSEA analysis of a predefined set of genes against a background of the output of differential gene expression generated by Seurat FindMarkers DESeq2
.
I have tried exploring clusterProfiler
to achieve this but I am struggling to find an appropriate workflow. I am also getting some errors and after days of trying to debug, I am at the end of my rope. Could someone please help me with this. I have attached my code below and could really use someone's help.
# 1. Select only UP genes and do DEG -----
ex_up <- ex %>%
dplyr::select(Hu.ESG, Direction, ESG_ATAC_UP) %>%
filter(Direction == "UP" & ESG_ATAC_UP == 1 & Hu.ESG != "NA") # Remove NAs
## Handle duplicates -----
which(duplicated(ex_up$Hu.ESG))
sum(duplicated(ex_up$Hu.ESG))
ex_up <- ex_up[!duplicated(ex_up$Hu.ESG), ]
# 2. Make pseudobulk and do DEG -----
sobj_pseudo <- AggregateExpression(sobj, assays = "RNA", return.seurat = T,
group.by = c("Tissue", "sampleType"))
sobj_pseudo <- NormalizeData(sobj_pseudo)
sobj_pseudo <- ScaleData(sobj_pseudo, rownames(sobj_pseudo))
sobj_pseudo <- SetIdent(sobj_pseudo, value = "orig.ident")
# Round to integer values
sobj_pseudo@assays$RNA@counts <- round(sobj_pseudo@assays$RNA@counts)
# Compute DEG between healthy and cancer
deg <- FindMarkers(sobj_pseudo,
ident.1 = "tumor",
ident.2 = "adjacent normal lung tissue",
test.use = "DESeq2") %>%
tibble::rownames_to_column(var = "gene")
deg_up <- deg %>%
arrange(desc(avg_log2FC)) %>%
filter(avg_log2FC > 0 & p_val < 0.05)
# 3. Perform GO analysis for the ex gene sets -----
# Vector of ex UP genes
ex_up_gene <- ex_up$Hu.ESG
# Convert UP ex genes to Entrez IDs
ex_up_gene <- bitr(ex_up_gene, fromType = "SYMBOL",
toType = "ENTREZID", OrgDb = org.Hs.eg.db)
## GO analysis -----
ex_up_GO <- enrichGO(gene = ex_up_gene$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP", # Biological Process
pvalueCutoff = 0.05,
readable = T)
## Reduce redundancy of enriched GO terms -----
ex_up_GO = simplify(ex_up_GO)
nrow(ex_up_GO)
# Filter output of GO analysis -----
ex_up_GO <- gofilter(ex_up_GO, level = 4)
# 4. GSEA between DEG output and ex gene set -----
## Create TERM2GENE data frame (ex geneset) -----
ex_up_go <- ex_up_GO@result %>%
as_tibble()
ex_term2gene <- ex_up_go %>%
dplyr::select(ID, geneID) %>%
tidyr::separate_rows(geneID, sep = "/") %>%
dplyr::rename(Term = ID, Gene = geneID)
# Use bitr to convert gene symbols to ENTREZ IDs
entrez_up <- bitr(ex_term2gene$Gene,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
# Join the original data with the conversion table to replace gene symbols with ENTREZ IDs
ex_term2gene <- ex_term2gene %>%
left_join(entrez_up, by = c("Gene" = "SYMBOL")) %>%
dplyr::select(Term, ENTREZID)
colnames(ex_term2gene)[2] <- "Gene"
## Prepare output of DEG -----
# Create a named vector of gene log2FC values for GSEA
deg_up_df<- deg_up %>%
arrange(desc(avg_log2FC)) %>%
pull(avg_log2FC)
names(deg_up_df) <- deg_up$gene
# Extract the gene symbols from the named vector
ex_symbols_up <- names(deg_up_df)
# Use bitr to convert gene symbols to ENTREZ IDs
deg_entrez_up <- bitr(ex_symbols_up,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
# Convert deg_up_vector to a data frame
deg_up_df <- data.frame(SYMBOL = names(deg_up_df), VALUE = deg_up_df)
# Merge the conversion result with the original values using left_join
deg_entrez_up <- left_join(deg_entrez_up, deg_up_df, by = "SYMBOL")
# Create a named vector with ENTREZ IDs
gene_list_up_entrez <- setNames(deg_entrez_up$VALUE, deg_entrez_up$ENTREZID)
## Perform GSEA -----
gsea_up <- GSEA(
geneList = gene_list_up_entrez, # background: named vector, output of deg
TERM2GENE = ex_term2gene, # genelist to test
minGSSize = 300,
maxGSSize = 1000,
pvalueCutoff = 1,
verbose = T,
scoreType = "pos"
)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (2.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Even after setting pvalueCutoff
to 1 there are no enriched genesets. I urgently need help with this. If required, I can also share the term2gene and output of DEG here. Thank you!
Hi, this absolutely works! Thank you so much for taking the time to help me with this problem, you really saved the day. I've been scratching my head for the past few weeks about how I can fix this. I will be more careful about setting a correct value for minGSSize in future!