Hello everyone,
I have been tasked to carry out a GSEA analysis of a predefined set of genes against a background of the output of differential gene expression generated by Seurat FindMarkers DESeq2
.
I have tried exploring clusterProfiler
to achieve this but I am struggling to find an appropriate workflow. I am also getting some errors and after days of trying to debug, I am at the end of my rope. Could someone please help me with this. I have attached my code below and could really use someone's help.
# 1. Select only UP genes and do DEG -----
ex_up <- ex %>%
dplyr::select(Hu.ESG, Direction, ESG_ATAC_UP) %>%
filter(Direction == "UP" & ESG_ATAC_UP == 1 & Hu.ESG != "NA") # Remove NAs
## Handle duplicates -----
which(duplicated(ex_up$Hu.ESG))
sum(duplicated(ex_up$Hu.ESG))
ex_up <- ex_up[!duplicated(ex_up$Hu.ESG), ]
# 2. Make pseudobulk and do DEG -----
sobj_pseudo <- AggregateExpression(sobj, assays = "RNA", return.seurat = T,
group.by = c("Tissue", "sampleType"))
sobj_pseudo <- NormalizeData(sobj_pseudo)
sobj_pseudo <- ScaleData(sobj_pseudo, rownames(sobj_pseudo))
sobj_pseudo <- SetIdent(sobj_pseudo, value = "orig.ident")
# Round to integer values
sobj_pseudo@assays$RNA@counts <- round(sobj_pseudo@assays$RNA@counts)
# Compute DEG between healthy and cancer
deg <- FindMarkers(sobj_pseudo,
ident.1 = "tumor",
ident.2 = "adjacent normal lung tissue",
test.use = "DESeq2") %>%
tibble::rownames_to_column(var = "gene")
deg_up <- deg %>%
arrange(desc(avg_log2FC)) %>%
filter(avg_log2FC > 0 & p_val < 0.05)
# 3. Perform GO analysis for the ex gene sets -----
# Vector of ex UP genes
ex_up_gene <- ex_up$Hu.ESG
# Convert UP ex genes to Entrez IDs
ex_up_gene <- bitr(ex_up_gene, fromType = "SYMBOL",
toType = "ENTREZID", OrgDb = org.Hs.eg.db)
## GO analysis -----
ex_up_GO <- enrichGO(gene = ex_up_gene$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP", # Biological Process
pvalueCutoff = 0.05,
readable = T)
## Reduce redundancy of enriched GO terms -----
ex_up_GO = simplify(ex_up_GO)
nrow(ex_up_GO)
# Filter output of GO analysis -----
ex_up_GO <- gofilter(ex_up_GO, level = 4)
# 4. GSEA between DEG output and ex gene set -----
## Create TERM2GENE data frame (ex geneset) -----
ex_up_go <- ex_up_GO@result %>%
as_tibble()
ex_term2gene <- ex_up_go %>%
dplyr::select(ID, geneID) %>%
tidyr::separate_rows(geneID, sep = "/") %>%
dplyr::rename(Term = ID, Gene = geneID)
# Use bitr to convert gene symbols to ENTREZ IDs
entrez_up <- bitr(ex_term2gene$Gene,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
# Join the original data with the conversion table to replace gene symbols with ENTREZ IDs
ex_term2gene <- ex_term2gene %>%
left_join(entrez_up, by = c("Gene" = "SYMBOL")) %>%
dplyr::select(Term, ENTREZID)
colnames(ex_term2gene)[2] <- "Gene"
## Prepare output of DEG -----
# Create a named vector of gene log2FC values for GSEA
deg_up_df<- deg_up %>%
arrange(desc(avg_log2FC)) %>%
pull(avg_log2FC)
names(deg_up_df) <- deg_up$gene
# Extract the gene symbols from the named vector
ex_symbols_up <- names(deg_up_df)
# Use bitr to convert gene symbols to ENTREZ IDs
deg_entrez_up <- bitr(ex_symbols_up,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
# Convert deg_up_vector to a data frame
deg_up_df <- data.frame(SYMBOL = names(deg_up_df), VALUE = deg_up_df)
# Merge the conversion result with the original values using left_join
deg_entrez_up <- left_join(deg_entrez_up, deg_up_df, by = "SYMBOL")
# Create a named vector with ENTREZ IDs
gene_list_up_entrez <- setNames(deg_entrez_up$VALUE, deg_entrez_up$ENTREZID)
## Perform GSEA -----
gsea_up <- GSEA(
geneList = gene_list_up_entrez, # background: named vector, output of deg
TERM2GENE = ex_term2gene, # genelist to test
minGSSize = 300,
maxGSSize = 1000,
pvalueCutoff = 1,
verbose = T,
scoreType = "pos"
)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (2.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Even after setting pvalueCutoff
to 1 there are no enriched genesets. I urgently need help with this. If required, I can also share the term2gene and output of DEG here. Thank you!
Hi, thanks for your reply! Removing
scoreType = "pos"
generates another error:The overlap is actually 0...
Sorry that's a typo on my end:
However, if you are running a GSEA and your ranked stat vector is all positives that might be causing issues. GSEA is designed to work with a ranked stat vector that has negative values as well as positive and doesn't typically require prefiltering.
From the GSEA User Guide:
Especially since you're running DE on pseudobulk data, it would make sense to include both positive and negative values.
Okay I see, no worries. I have rerun the command and there are only 5 intersecting genes. Is this why there are no enriched genesets?
Oh... if this is the case you might need to double check how you're making your term2gene. Your term2gene should include most if not all the data. Also you seem to be converting both your expression vector and your term2gene from SYMBOLS to ENTREZ. Any reason you can't just keep them as symbols?
The error still persists even if I do not filter... I checked my
term2gene
and it has a lot more genes (more than 1000) compared to the output ofFindMarkers
. Could this be the problem? I also noticed that some genes are repeating in theterm2gene
as they are associated with multiple GO terms. Should I filter this using some kind of parameter to only have 1 gene per GO term?Hi,
Could you upload your marker output and term2gene and share them with me?
Hi, Thanks a lot for your help once again. I have attached the output of
FindMarkers
i.e. DEG (unfiltered and unsorted) and theterm2gene
in the WeTransfer link below. Regarding your previous question about converting to Entrez IDs, that was a mistake on my part and I didn't need to do that.https://we.tl/t-DcnILihnKI
Okay this makes a lot of sense, I was actually separating the output of DEG by
avg_log2FC
for UP and DOWN genes. I will not try this without filtering the output. Thanks a lot for your help, you really saved my day!