There are a lot of R-based pathway analysis tools. There are also supporting data packages for the actual pathways from GO, KEGG, or Reactome. However, support for Molecular Signatures Database (MSigDB) from the GSEA within the R ecosystem is fairly limited. You have to import GMT files, re-structure the resulting objects, and potentially convert genes from human to other species. All of these are relatively trivial, but it adds up. As Hadley Wickham said: "you should consider writing a function whenever you've copied and pasted a block of code more than twice". Functions are easy to share, but datasets are trickier. So I made an R package that includes both: msigdbr (on CRAN and GitHub).
With msigdbr, you can retrieve MSigDB gene sets:
- in an R-friendly format (a "tidy" data frame with one gene per row that work well with the tidyverse packages)
- as both gene symbols and Entrez Gene IDs
- for multiple frequently studied organisms (not everyone works with exclusively human data and it's easy run into problems retrieving gene orthologs)
- that can be used and shared in a single script (without requiring additional files or an active internet connection)
There is a vignette available with more info and usage examples.
There are a few other similar existing solutions, but I couldn't find any that addressed all of my pain points. I also just wanted to make an R package and this seemed like a good idea that was simple enough to start with. This probably doesn't need to be explicitly stated, but any feedback is welcome, which is why it's good to post here.
Hey Igor great package!
I was hoping to use it for some fly analysis but had two quick questions about this if you have time; in the vignette you have:
Could you maybe clarify a little more how you are making these connections in your package?
I see at the bottom:
So that suggests to me that you are taking the human genes and finding their most likely orthologs (in each given species) based on the overlap between multiple databases. What happens in cases where the closest homolog isn't that "great" of a match (even with overlaps across different DBs)? Is there any specific scoring threshold (AA similarity etc...) or is it just based on database annotations?
Just as an example looking at the major ping-pong piRNA Drosophila protein genes we have: AGO3 and aub
The homologs for humans from the package are:
AGO3: PIWIL1 and PIWIL2
aub: PIWIL1
The current literature seems to suggest that AGO3 is slightly more similar PIWIL2. Obviously I wouldn't expect someone to go in and manually check each entry - furthermore for computational reproducibility I think it makes sense to keep the standard association methods universal. So please don't take this as criticism - just wondering how this would affect analysis.
Also how has this package been working for you? Do researchers/collaborators seem to agree with the results in general?
EDIT: Also quick suggestion. It might be nice to include the human-readable "brief description" for each of the annotation sets. Maybe an extra $description column in the m_df object.
Example:
CSR_EARLY_UP.V1_UP == Genes up-regulated in early serum response of CRL 2091 cells (foreskin fibroblasts).
It was pretty easy for me to do myself by downloading the card from msigdb but just thought it might be good to already include in a "all-in-one" package such as this one - especially if less "computationally experienced" people want to use it.
Thanks for trying the package!
I only keep the homologs that are listed in multiple databases and keep the one that is confirmed by most (some genes can return dozens of homologs across different resources). I don't expect every gene to be correct, but using gene sets should hopefully balance that out.
I work mostly with mammalian genomes, so the homologs are usually very obvious and the pathways should be similar. I can't really comment on more distant species. For some context, some of the MSigDB gene sets are based on zebra fish data, so cross-species comparisons are acceptable to some degree.
Hi Igor and All, thank you for developing this tool! I am able to access 2 MSigDB gene sets with the following:
This gives me C2 and C3 MSigDBs in one object. I'm interested in using these two gene set collections for GSVA analysis (Hanzelmann et al 2013 and Hanzelmann et al 2019). This package takes as input either
GeneSetCollection
fromGSEAbase
or a list object "with names identifying gene sets and each entry of the list containing the gene identifiers of the genes forming the corresponding set." My question is if there's a way to generate alist()
object fromm_df.c2.c3
here such that it's a list of gene sets with EntrezIDs in each gene set? Or if it's possible to go directly fromm_df.c2.c3
to aGeneSetCollection
object?Thank you in advance for your time!
I prefer to use a standard list as opposed to a more specialized object
GeneSetCollection
. You can create a list with:split(x = m_df.c2.c3$entrez_gene, f = m_df.c2.c3$gs_name)
if you want to use Entrez IDs. You can also use gene symbols. I clarify this because after seeing hundreds of expression datasets, I have never seen any with Entrez IDs. They only seem to come up in the context of pathways analysis, making many tutorials unnecessarily complicated, forcing users to convert to and/or from Entrez IDs.If you are wondering, there is actually a example in the vignette (for running fgsea): https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html
This worked, thank you! I'm using Entrez IDs for pathway analysis as you alluded to. My initial expression matrix is a
cells
xfeatures
matrix withfeatures
as gene symbols. However, thegsva
package recommends Entrez IDs, so I converted the gene symbols to Entrez IDs to keep it all consistent. I'm going to take the list of gene sets from here to generate acells
xgene sets
matrix for further down stream analysis (clustering, etc...). Thanks again for the speedy reply!Happy to hear it worked!
Does
gsva
recommend Entrez IDs or do they just use them in their examples (possibly for historical reasons)?Every time you convert your genes from one format to another, you lose some information. If you already have gene symbols, you may as well keep everything as gene symbols if you can.
Thanks for pointing that out. I actually did lose some genes when I went from
symbols
toEntrez ID
. I just went back to read the paper. Initially, I thought thatgsva
only uses Entrez ID and leaves it up to the user to make the conversion, unless geneset is provided as aGeneSetCollection
, in which case it does the transformation internally. But in re-reading (quote from Hanzelmann et al 2019):Now that I found
msigdbr
, I will go back to leave myExpressionSet
ingene symbol
format and use mouse gene symbols frommsigdbr
to see ifgsva()
will handle it. Thanks again!Just to follow up: I was able to keep all of my genes without having to convert from gene symbol to Entrez ID. Kept everything in gene symbol format and
gsva()
ran without issues. Thanks again!