I have a list(1000's) of mouse genes that I am interested to test for any GO term enrichment.
I am wondering would it be Ok to use the whole set of mouse genes as a baseline to test for enrichment of GO terms? OR randomly subsample same number of genes as in the test case
I think the answer depends on how you came to get your list of 1000 genes. For example, in microarray experiments you would have a chip with e.g. 30 thousand genes on it and maybe 1000 of the genes would be called differentially expressed. Your background list should contain only those 30K genes but not, for example, 50 genes that are not on the chip because they were not measured. For RNA Seq it is harder because there are inherent biases in each gene is measured (e.g. somewhat described in the GOSeq paper) and some people use complicated statistical tests on top of that which can add more bias.
If there are no significant biases in how you obtained your 1000 genes then I think using the entire subset would give you the clearest signal because it would contain the most information. I could imagine findings that are significant in the whole set not being significant in random subsets just because of lack of statistical power to find them with smaller numbers. Some of the categories have very few genes and high statistical noise.
If there are biases in how you obtained your 1000 genes then you need to account for them, which could be as simple as using an appropriate background gene list or a big huge bioinformatics headache depending on the experiment.
Thanks Michele. The 1000 gene list this time is indeed coming from a microarray chip. Just so I get it right my baseline in this case should only be genes that were supposed to be measured by the chip right ?
My gene lists are always associated with a value, either p-value or enrichment or some other factor.
So, I use this value to take the top lets say 500, 1000 or 3000 genes depending on the items in the list.
But even if you take a very large list, it shouldn't matter, as most of the tools like David, display the top enriched categories also pointing at how many genes make up that category in the users list and the database.
Best thing would be to sample twice-thrice and test, if you are getting to varying results, then better is to have a cut-off and run the GO analysis on that list.
Thanks Michele. The 1000 gene list this time is indeed coming from a microarray chip. Just so I get it right my baseline in this case should only be genes that were supposed to be measured by the chip right ?
Yes. That's how I was taught to do it.