Question

PFAM, KEGG enrichment for non model organism

0

Entering edit mode

7.8 years ago

mjp ▴ 30

Hi there,

There are couple of posts circulating around but I couldn't find definitive answer for a non-model organism scenario.

How would one go about finding out which terms, being that PFAM, KEGG or others, are enriched in a group of genes of interest, provided the universe as a background to calculate the enrichment from?

I am familiar with topGO approach that can accept the genes of interest in a simple tab-delimited format of IDs of some kind (might be made up names) and universe as the same ID with GOid simply listed on the same row, separated by comma.

universe:

gene1 GO:0003677, GO:0004803, GO:0006313 ...

gene2 GO:0000160, GO:0003677, GO:0000160 ...

...

genes of interest:

gene1

gene2

I've found myself wondering whether there is a package that would be able to take any kind of terms (PFAM, KEGG, GO, XX) and find whether a subset of IDs of interest is significantly enriched within a broader set. Annotations could happen at later stage.

Any assistance, suggestions, pointers would be appreciated.

enrichment domain non model organism • 5.2k views

ADD COMMENT • link 7.7 years ago by mjp ▴ 30

score 1 · Answer 1 · 2017-03-09

1

Entering edit mode

7.8 years ago

Lars Juhl Jensen 11k

I do not know a tool that would do precisely what you describe, i.e. to let you specify the annotations for all the genes and do enrichment analysis with that.

However, if what you want is just to look for enriched KEGG maps and protein domains, you could use the enrichment functionality in STRING. Just go to the website, select "Multiple proteins", paste in the names of your genes of interest, select your organism, and click through till you get a network. On the network page, click the "Analysis" tab below the network to show the enrichment results. STRING does not cover every sequenced organism, but with more than 2000 genomes in the current version, it covers a lot more than just model organisms. So if your organism of interest is among them, it would seem the easy solution.

ADD COMMENT • link 7.8 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Thank you for your suggestions. STRING looks quite impressive.

I'm currently looking at some novel microbes and fungi. I do the gene calling myself, so most of the genes are not initially publicly identifiable. I could perhaps use a sequence as an input to STRING but I would like to do things in a high-throughput manner. Any predefined organism set would rather not suit me. I also don't see why any enrichment method should rely on any organism other that purely for the purpose of predefined set of gene universe.

Thank you!

ADD REPLY • link 7.8 years ago by mjp ▴ 30

score 0 · Answer 2 · 2017-05-04

0

Entering edit mode

7.7 years ago

mjp ▴ 30

I have decided I will use the most general approach that does not depend on any third party software - Fisher test. Using Fisher is fairly straight forward and applicable to any sort of database.

Thanks to all that contributed.

ADD COMMENT • link 7.7 years ago by mjp ▴ 30

0

Entering edit mode

You're very welcome, but Fisher's exact test will only get you the first step. Don't forget correction for multiple testing.

ADD REPLY • link 7.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

I do adjust my p.value for multiple testing :) I use standard R packages to achieve this. Thanks!

ADD REPLY • link 7.7 years ago by mjp ▴ 30