Entering edit mode
11 months ago
jon50250
•
0
Hello! I'm having issues generating an adequate geneList for running gseGO in clusterProfiler, using keytype = "GO"
Similar issues have been described here: No gene mapped
gseGO code is:
gse <- gseGO(geneList = gene_List,
ont = "ALL", #ont one of “BP”, “MF”, “CC” or “ALL”
OrgDb = OrgDb,
minGSSize = 3, # minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize = 800, #maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
keyType = "GO",
exponent = 1,
pvalueCutoff = 0.05,
pAdjustMethod = "BH", #one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
verbose = TRUE,
)
and error is:
preparing geneSet collections...
--> Expected input gene ID: GO:2000274,GO:0031534,GO:0045503,GO:0008081,GO:0009415,GO:0004386
Error in check_gene_id(geneList, geneSets) :
--> No gene can be mapped....
Gene list code and output is:
Read in differential expression result table - need foldchange values here for gsea
data = read.table("res.df.txt",
header = TRUE,
row.names = 1,
#res.df.txt
#M.DEGs.res.df.txt
#F.DEGs.res.df.txt
)
data <- dplyr::select(data, c('log2FoldChange')) %>%
arrange(desc(log2FoldChange)) ## sort the list in decreasing order (required for clusterProfiler)
write.csv(data, file = "res.L2FC.csv")
#making geneList
#need to get KEGG ID and LFC in same df for geneList?
#Setting rownames of res file to first column as ID
data <- rownames_to_column(data, "ID")
#leftjoin go and res dfs by geneID
go.l2fc <- left_join(EggNOG.GO.gIDs, data)
#remove rows with NO L2FC value - the results shows all genes, some of which have GO terms
#but no significance for the deseq2 dataset <- https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
go.l2fc <- na.omit(go.l2fc)
#have multiple geneIDs/LFC now so need to figure out how to get TOP $l2fc/lfc/etc. result for each GO ID
go.l2fc <- go.l2fc[!duplicated(go.l2fc[1]),]
colnames(go.l2fc)[2] <- "GOs"
#adding random rank to same l2fc values (for geneIDs) for unique K#s that come from different transcriptIDs
go.l2fc <- go.l2fc %>% mutate(rank = rank(log2FoldChange, ties.method = "random")) %>% dplyr::arrange(desc(rank))
## assume 1st column is ID
## 2nd column is FC
## feature 1: numeric vector (L2FC value)
geneList = go.l2fc[,3]
## feature 2: named vector (GO terms or geneIDs??)
names(geneList) = as.character(go.l2fc[,2])
#feature 2.5: Adding geneNames to middle column?
## feature 3: decreasing order
geneList = sort(geneList, decreasing = TRUE)
and the geneList/go.l2fc files look like this (see screenshots)
Thank you for all the help!
It's very unclear why you are using GO terms as gene ids. This should not be the case at all.
The error says the expected input geneID is GO terms though?
I remade the geneList with my geneIDs and reran. Still got the same error. From the manual, I NEED to have entrezIDs for my GO terms then?
Maybe I need to run enricher fucntion then? from this post: Clusterprofiler error: No gene can be mapped
As far as I can tell the default GO term mapping in Bioconductor Org.eg.db is to Entrez ids. If you use
gseGO
you are locked into using an Org.eg.db, so yes it looks like you need Entrez ids, but note the ranked gene list should be a named numerical vector with the value used to rank the genes where names are the gene names (this is not well documented in the clusterProfiler manual).For more flexibility regarding gene ids you can use the generic
clusterProfiler::GSEA
function where you provide theTERM2GENE
andTERM2NAME
mappings for your gene sets. I can provide some example code for this but don't have time at this exact moment.Thank you! I can convert my gene Ids (g####) to entrez IDs (I think). I ran EggNOG on my genes for annotation - there may be a way to extract entrezIDs from this output. Would the output look like what I have here though, just with entrezIDs? For my geneList, I would need Names to be the entrezIDs associated with each set of GO terms and a L2FC value to rank the genes?
Will the multiple GO:terms-per-gene be suitable for gseGO too? I ran gseKEGG with a single Knumber and L2FC value (see screenshot), even though my genes were annotated with many K numbers.
I would really appreciate some code when you have the time, tysm
Please keep your follow up questions to a minimum, it is overwhelming to try to answer so many questions.
DO NOT use log2 fold change alone for the ranking, you should take statistical significance into account. I use DESeq2 for DE testing and rank genes by the
stat
metric returned from DESeq test. Other option are described at https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html?Run_GSEA_Pagemost genes will map to more than one go term, the suitability of this is debated in the field but it is not something you need to account for in your code (i.e. you shouldn't need to filter)
Here is some example code I can provide:
Awesome, thanks! I will use stat to rank my genes. I will not spam in the future, apologies!
I have the GO:terms, but I need to generate TERM2NAME (I see this is optional for GSEA()). I will checkout this thread: https://support.bioconductor.org/p/9144621/