Errors with GSEA using clusterProfiler
2
0
Entering edit mode
5 months ago
bio_info ▴ 20

Hello everyone,

I have been tasked to carry out a GSEA analysis of a predefined set of genes against a background of the output of differential gene expression generated by Seurat FindMarkers DESeq2.

I have tried exploring clusterProfiler to achieve this but I am struggling to find an appropriate workflow. I am also getting some errors and after days of trying to debug, I am at the end of my rope. Could someone please help me with this. I have attached my code below and could really use someone's help.

# 1. Select only UP genes and do DEG -----
ex_up <- ex %>%
  dplyr::select(Hu.ESG, Direction, ESG_ATAC_UP) %>%
  filter(Direction == "UP" & ESG_ATAC_UP == 1 & Hu.ESG != "NA") # Remove NAs

## Handle duplicates -----
which(duplicated(ex_up$Hu.ESG))
sum(duplicated(ex_up$Hu.ESG))

ex_up <- ex_up[!duplicated(ex_up$Hu.ESG), ]

# 2. Make pseudobulk and do DEG -----
sobj_pseudo <- AggregateExpression(sobj, assays = "RNA", return.seurat = T,
                    group.by = c("Tissue", "sampleType"))

sobj_pseudo <- NormalizeData(sobj_pseudo)
sobj_pseudo <- ScaleData(sobj_pseudo, rownames(sobj_pseudo)) 

sobj_pseudo <- SetIdent(sobj_pseudo, value = "orig.ident")

# Round to integer values
sobj_pseudo@assays$RNA@counts <- round(sobj_pseudo@assays$RNA@counts)

# Compute DEG between healthy and cancer
deg <- FindMarkers(sobj_pseudo,
                   ident.1 = "tumor", 
                   ident.2 = "adjacent normal lung tissue", 
                   test.use = "DESeq2") %>% 
  tibble::rownames_to_column(var = "gene")

deg_up <- deg %>%
  arrange(desc(avg_log2FC)) %>% 
  filter(avg_log2FC > 0 & p_val < 0.05)

# 3. Perform GO analysis for the ex gene sets -----
# Vector of ex UP genes
ex_up_gene <- ex_up$Hu.ESG

# Convert UP ex genes to Entrez IDs
ex_up_gene <- bitr(ex_up_gene, fromType = "SYMBOL",
                                   toType = "ENTREZID", OrgDb = org.Hs.eg.db)

## GO analysis -----
ex_up_GO <- enrichGO(gene = ex_up_gene$ENTREZID,
                   OrgDb = org.Hs.eg.db,
                   keyType = "ENTREZID",
                   ont = "BP",  # Biological Process
                   pvalueCutoff = 0.05,
                   readable = T)

## Reduce redundancy of enriched GO terms -----
ex_up_GO = simplify(ex_up_GO)
nrow(ex_up_GO)

# Filter output of GO analysis -----
ex_up_GO <- gofilter(ex_up_GO, level = 4)


# 4. GSEA between DEG output and ex gene set -----
## Create TERM2GENE data frame (ex geneset) -----
ex_up_go <- ex_up_GO@result %>% 
  as_tibble()

ex_term2gene <- ex_up_go %>%  
  dplyr::select(ID, geneID) %>%
  tidyr::separate_rows(geneID, sep = "/") %>%
  dplyr::rename(Term = ID, Gene = geneID)


# Use bitr to convert gene symbols to ENTREZ IDs
entrez_up <- bitr(ex_term2gene$Gene, 
                        fromType = "SYMBOL", 
                        toType = "ENTREZID", 
                        OrgDb = org.Hs.eg.db)

# Join the original data with the conversion table to replace gene symbols with ENTREZ IDs
ex_term2gene <- ex_term2gene %>%
  left_join(entrez_up, by = c("Gene" = "SYMBOL")) %>%
  dplyr::select(Term, ENTREZID)

colnames(ex_term2gene)[2] <- "Gene"

## Prepare output of DEG -----
# Create a named vector of gene log2FC values for GSEA
deg_up_df<- deg_up %>%
  arrange(desc(avg_log2FC)) %>%
  pull(avg_log2FC)
names(deg_up_df) <- deg_up$gene

# Extract the gene symbols from the named vector
ex_symbols_up <- names(deg_up_df)

# Use bitr to convert gene symbols to ENTREZ IDs
deg_entrez_up <- bitr(ex_symbols_up, 
                          fromType = "SYMBOL", 
                          toType = "ENTREZID", 
                          OrgDb = org.Hs.eg.db)

# Convert deg_up_vector to a data frame
deg_up_df <- data.frame(SYMBOL = names(deg_up_df), VALUE = deg_up_df)

# Merge the conversion result with the original values using left_join
deg_entrez_up <- left_join(deg_entrez_up, deg_up_df, by = "SYMBOL")

# Create a named vector with ENTREZ IDs
gene_list_up_entrez <- setNames(deg_entrez_up$VALUE, deg_entrez_up$ENTREZID)

## Perform GSEA -----
  gsea_up <- GSEA(
    geneList = gene_list_up_entrez, # background: named vector, output of deg
    TERM2GENE = ex_term2gene, # genelist to test
    minGSSize = 300,
    maxGSSize = 1000,
    pvalueCutoff = 1,
    verbose = T,
    scoreType = "pos"
  )
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

Even after setting pvalueCutoff to 1 there are no enriched genesets. I urgently need help with this. If required, I can also share the term2gene and output of DEG here. Thank you!

clusterProfiler GSEA • 1.4k views
ADD COMMENT
1
Entering edit mode
5 months ago

Based on the data you provided, your minGSSize = 300 is far too high.

library(readxl)

# read in the DE data and prep it for GSEA
deg_output <- as.data.frame(read_xlsx("~/Local_Work/temp/deg_output.xlsx"))
res_lfc <- deg_output$avg_log2FC
names(res_lfc) <- deg_output$gene
res_lfc <- res_lfc[order(res_lfc, decreasing = TRUE)]

# read the term2gene and do some summary checks
term2gene <- read_xlsx("~/Local_Work/temp/term2gene.xlsx")
sort(table(term2gene$Term), decreasing = TRUE)[1:5]
#GO:1903706 GO:0009615 GO:0032102 GO:0002697 GO:0050866 
#        22         20         20         18         18 

## your term sizes are ~20 or less, your GSEA(..., minGSSize) should reflect this

library(clusterProfiler)

gsea.res <- GSEA(geneList = res_lfc,
                 TERM2GENE = term2gene,
                 minGSSize = 0,
                 maxGSSize = 500,
                 pvalueCutoff = 1,
                 verbose = TRUE)

dim(gsea.res@result)
# [1] 27 11

length(which(gsea.res@result$p.adjust < 0.05))
[1] 9
ADD COMMENT
0
Entering edit mode

Hi, this absolutely works! Thank you so much for taking the time to help me with this problem, you really saved the day. I've been scratching my head for the past few weeks about how I can fix this. I will be more careful about setting a correct value for minGSSize in future!

ADD REPLY
0
Entering edit mode
5 months ago

Try removing scoreType = "pos" from the GSEA(...) call. I can't clock any other obvious errors in code. What's the overlap between your term2gene and your gene list, eg:

length(intersect(gene_list_up_entrez, ex_term2gene$Gene))

ADD COMMENT
0
Entering edit mode

Hi, thanks for your reply! Removing scoreType = "pos" generates another error:

using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  All values in the stats vector are greater than zero and scoreType is "std", maybe you should switch to scoreType = "pos".

The overlap is actually 0...

> length(intersect(gene_list_up_entrez, ex_term2gene$Gene))
[1] 0
ADD REPLY
1
Entering edit mode

Sorry that's a typo on my end:

length(intersect(names(gene_list_up_entrez), ex_term2gene$Gene))

However, if you are running a GSEA and your ranked stat vector is all positives that might be causing issues. GSEA is designed to work with a ranked stat vector that has negative values as well as positive and doesn't typically require prefiltering.

From the GSEA User Guide:

The GSEA algorithm does not filter the expression dataset and generally does not benefit from your filtering of the expression dataset. During the analysis, genes that are poorly expressed or that have low variance across the dataset populate the middle of the ranked gene list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score. By removing such genes from your dataset, you may actually reduce the power of the statistic and processing time is rarely a factor as GSEA can easily analyze 22,000 genes with even modest processing power. However, an exception exists for RNA-seq datasets where GSEA may benefit from the removal of extremely low count genes (i.e., genes with artifactual levels of expression such that they are likely not actually expressed in any of the samples in the dataset).

Especially since you're running DE on pseudobulk data, it would make sense to include both positive and negative values.

ADD REPLY
0
Entering edit mode

Okay I see, no worries. I have rerun the command and there are only 5 intersecting genes. Is this why there are no enriched genesets?

> length(intersect(names(gene_list_up_entrez), ex_term2gene$Gene))
[1] 5
ADD REPLY
0
Entering edit mode

Oh... if this is the case you might need to double check how you're making your term2gene. Your term2gene should include most if not all the data. Also you seem to be converting both your expression vector and your term2gene from SYMBOLS to ENTREZ. Any reason you can't just keep them as symbols?

ADD REPLY
0
Entering edit mode

The error still persists even if I do not filter... I checked my term2gene and it has a lot more genes (more than 1000) compared to the output of FindMarkers. Could this be the problem? I also noticed that some genes are repeating in the term2gene as they are associated with multiple GO terms. Should I filter this using some kind of parameter to only have 1 gene per GO term?

ADD REPLY
0
Entering edit mode

Hi,

Could you upload your marker output and term2gene and share them with me?

ADD REPLY
0
Entering edit mode

Hi, Thanks a lot for your help once again. I have attached the output of FindMarkers i.e. DEG (unfiltered and unsorted) and the term2gene in the WeTransfer link below. Regarding your previous question about converting to Entrez IDs, that was a mistake on my part and I didn't need to do that.

https://we.tl/t-DcnILihnKI

ADD REPLY
0
Entering edit mode

Okay this makes a lot of sense, I was actually separating the output of DEG by avg_log2FC for UP and DOWN genes. I will not try this without filtering the output. Thanks a lot for your help, you really saved my day!

ADD REPLY

Login before adding your answer.

Traffic: 3086 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6