Question

GOSemSim - difference between mgeneSim and mgoSim

0

Entering edit mode

21 months ago

Giovanni ▴ 10

Hi everyone, I'm trying to use GoSemSim but I'm struggling due to his results. I started using mgeneSim function, and i passed an array of 11000+ EntrezID genes. It gave me a similarity matrix with some columns and rows all containing value "1". I think is because, after mapping the EntrezID genes to GO, I noticed that some set of Go ID of two different genes have a GO ID in common.

To solve this problem, I tried to create a similarity matrix without mgeneSim, filling all the entries with the output of mgoSim for each couple of genes. In order to create this matrix, I calculated I need 30 like days, while mgeneSim just need a couple of hours.

Giving to mgeneSim and mgoSim the same parameters (measure="Wang", combine = "BMA") , the results are different. DO you know why?

How can I have consistent results from mgeneSim? Is it possible not to consider the GO ID two genes share?

A little example:

mgeneSim(c("3613", "83541", "5651", "23492", "157310"), semData=hsGO, measure="Wang", combine = "BMA", verbose = TRUE)

outputs

output

gosemsim similarity • 1.4k views

ADD COMMENT • link updated 21 months ago by DareDevil ★ 4.4k • written 21 months ago by Giovanni ▴ 10

4

Entering edit mode

21 months ago

DareDevil ★ 4.4k

mgeneSim: mgeneSim calculates the semantic similarity between two sets of genes based on their functional annotations. It utilizes the Gene Ontology (GO) database, which provides structured information about gene functions. The function computes the pairwise similarity scores between genes using the Wang/Resnik measure, which is based on the concept of information content.

mgoSim: On the other hand, mgoSim calculates the semantic similarity between two sets of GO terms. It measures the similarity between GO terms based on their annotations and hierarchical relationships in the GO database. The function employs the Wang measure, which takes into account not only the information content of the most informative common ancestor term but also the depth and number of annotated genes in the two terms being compared. The Wang measure provides a more comprehensive similarity score by considering multiple factors.

mgeneSim calculates the similarity between genes, while mgoSim calculates the similarity between GO terms.

ADD COMMENT • link 21 months ago by DareDevil ★ 4.4k

0

Entering edit mode

21 months ago

Giovanni ▴ 10

So I think I need mgeneSim for my analysis.

How can handle the rows and columns full of 1s? I think is because mgeneSim maps the EntrezID genes to GO, and some set of Go ID of two different genes have a GO ID in common. Do you have any better explanation or am I right?

Thank you so much

ADD COMMENT • link 21 months ago by Giovanni ▴ 10

score 4 · Accepted Answer · 2023-07-13

For Entrez ID, Consider your sets "gene_list.txt"

Gene ID
A2M   2
TNF   7124
....    .....

Then folllow the codes:

library(GOSemSim)
library(org.Hs.eg.db)
library(data.table) 

hsGO2 = godata('org.Hs.eg.db',  ont="BP", computeIC=FALSE) #ont = "MF" or "CC"

# reading the genes ID file
data = read.table(file = "gene_list.txt", header = T)

#storing in data structure data.table 
data = as.data.table(data)

#perform GoSemSim
result<-mgeneSim(data$ID, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)

#result matrix storing in dataframe.
res<-as.data.frame(as.table(result))

#writing the result to file
write.table(res, "result_bp_entrez.txt", quote = F, sep="\t")

For Gene Symbol, Consider your sets "gene_list.txt"

#Select gene ontology to find semantic similarity
hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="BP", computeIC=FALSE)

#perform GoSemSim
result<-mgeneSim(data$Gene, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)

#result matrix storing in dataframe.
res<-as.data.frame(as.table(result))

#writing the result to file
write.table(res, "result_bp_genes.txt", quote = F, sep="\t")