Hello~ I'm a newbie in GO analysis. Now I'm trying to complete GO enrichment by using Ontologizer, and confused about population set as well as study set.
I'm wondering if the un-annotated gene ids in population set and study set should be involved in GO enrichment.
The universe, aka the "background set", is generally accepted to be the set of genes which were measured in your experiment that are annotated to a GO term. Thus your study set (the genes of interest) should also be drawn from this population. If you think about how you'd set up the contingency table (comparing counts for genes in a category versus counts for genes of interest or universe), allowing genes that can't be in a category (because they haven't been annotated) isn't really a fair comparison for determining enrichment. As a newbie, you might start by consulting the literature. There's a lot of it, but two papers of interest would be: (1) Khatri P, Drăghici S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics; and (2) Rhee SY, Wood V, Dolinski K, Draghici S. (2008) Use and misuse of the gene ontology annotations. Nat Rev Genet.
You could also do a thought experiment. Your set of interest is likely to be something manageable (e.g. a few hundred?), and this number is often chosen regardless of the size of the background set. Imagine you're studying a novel organism, and one day someone dumps a few thousand more genes into your "background set" because they just ran another gene prediction program. A few well-annotated genes that have been showing up in your experiment could change significance quite a bit for no good reason.