The most common way of performing GO enrichment (hypergeometric tests on selected subsets of genes) is straightforward enough, but I'm finding a lot of papers which propose alternate methods which take into account the hierarchy of GO terms or gene scores:
I don't quite understand the math behind each methods, and obviously each paper claims that their method is better than the previous ones. I used the topGO package to test a couple, and the enrichment lists I generate show little similarity.
Could anyone provide practical guidelines on which method(s) would give the most relevant biological results? One caveat is that I am integrating this analysis within a larger automated pipeline, so automated tools are out.
Mine is a very empirical & practical opinion: I quite like DAVID as it is flexible and produces a very readable output. The R package RDAVIDWebService allows querying david via R and it's very handy (but note the limitations imposed by david about number of jobs that a single user can submit). Also, david has been used very extensively so it's very well tried & tested.
In general, I find the results from GO analyses to be so open to interpretation that I'm less concerned about finding the very best algorithm, I prefer to favour a practical approach. Neverthless, I should say that for RNA-seq data you might want to control for gene length (see http://genomebiology.com/2010/11/2/R14 and the GOSeq package).
I would suggest you download a specific GO database and use spreadsheet such as Excel to calculate your own hypergeometric p-values. Some online tools seem to use 'total GO space' and the GO class (BP, MF, CC) irrespective of the organism. I also think some tools although separate organism but mix all these three GOs to test for p-value. If you have control on these, you are sure of what are you looking at.
I think this thread has sufficient answers already, and in my honest opinion your advice is, well, bad. Using spreadsheets for analysis such as Ms Excel is likely to run in problems sooner or later by manual errors or undesirable conversions. Furthermore, it eliminates reproducible research.
I'm not sure why you would think doing this work manual in a fail-easy way would be better than using commonly used tools.
Well I do it this way and find it handy. I can see the numbers on my own, I have way to verify the p-value. Whenever I did it using AMIGO, I thought it uses the 'whole GO space', and difficult to verify. I work with plant data (mainly arabidopsis) and so I download the GO database from EMBL/TAIR may be once in three months and I already have process established to fit it into my spreadsheet. Calculations don't take very long but I am OK to wait even if required as I am sure what am I looking at. I can also apply single step or two step FDR correction with full knowledge of how it is affecting my data. It is often difficult to verify and cross check the output from online tools.
It is often difficult to verify and cross check the output from online
tools.
I fully agree with that and the reason why you prefer the manual -for full control- approach is also justified and very clear, thanks for explaining. However, for (re)usability I would still fit this approach in a Python or R script (also easier for sharing).
When rereading my statement above I realize it's overly harsh and I feel like I should apologize for that, you indeed have good reasons to work the way you do.
I for one think this is an awesome idea. This way you have full control and understand exactly what is going on.
As much as we would like to claim otherwise nobody really knows what is going on inside the deep bowels of GO enrichment tools - I never myself managed to reproduce their values, I don't understand what they do. I only know what they claim to do.
As bioinformaticians we need stop assuming that EXCEL = BAD - the damage is always done by people not understanding what a command line (or any tool for that matter) does no by the tool itself.
For any error caused via Excel there are more insidious and stupid behaviors in R for example. Did you know that when operating on two unequal length series (vectors for example) the shorter one will be SILENTLY reused in R? How many people know that? If you by accident have two unequal length objects and perform an element wise addition when the shorter one runs out it will start again from the beginning. It is hard to fathom how many errors that causes.
I use Excel a lot, and it definitely has a place in science. However, I would argue the room for error is smaller when executing a decent Rscript than when 'tampering' with spreadsheets. But spreadsheets don't kill data, people kill data ;-)
And vector recycling is usually a handy feature, although if unexpected it can definitely cause you a major (silent) headache...
the key word is "decent" R script - the vast majority of R scripts are not - and that is because R itself is not designed to help writing decent scrips, everything about it encourages quick and dirty, interactive type actions.
IMHO those "handy" features cause damage that we only see reported as irreproducible research - whatever effort they saved at an individual level they cost us scientists far more
R is an improperly designed language, it won the data analysis battle, it has already grown into a system that simply cannot be replaced by a well designed language. As our needs and complexities grow it starts to weigh more and more heavily.
Exactly, and because tampering with data formats interactively in R is as risky (if not worse) than excel manipulations, I prefer to do work in R using 'custom' Rscript command line tools. But unpredictable data types in R are driving me crazy when writing those Rscripts...
I should spend more time with Julia, if only I had the time.
I am working with these spreadsheets (separate for three GOs, two pathways) at the moment for improving the analysis pipeline but I use simple commands like countif, index-match, lookup, hypergeo, rank etc. I don't need to be a core programmer for this. Once I am done I will share.
Here is another one to make your life more harder: ermineJ
Publication: http://www.nature.com/nprot/journal/v5/n6/full/nprot.2010.78.html
I've come to really like ermineJ