I am fairly certain there is no such standard, but I'm also fairly certain some other people must have thought about this.
One advantage of a standard format is that it would simplify the running of multiple enrichment tools in parallel and comparing or combining results. This is particularly useful to us within the GO consortium, as we would like to compare analyses between newer/older versions of the ontology and annotations. A more ambitious aim is for publications that include GO enrichment results to provide these in a standard format, to simplify replicating results.
Note that it would not be necessary for all tools to be conformant in order for the standard to be successful. Converters could be provided to rewrite the ad-hoc output of heterogeneous tools to the standard form. However, it would help to have buy-in from some of the more popular tools.
I have listed some desiderata for such a standard:
- An abstract specification with different serializations for different purposes (tabular, JSON, XML, RDF)
- Extensibility
- Use of ontology terms in place of free text to describe algorithms, parameters and data processing (for example, the Ontology for Biomedical Investigations (OBI) has a rich collection of these)
Minimal information:
- Tool name + algorithm + version
- Input token list + token type (e.g. symbol)
- Background token list + token type (if provided)
- Token-gene ID mapping (plus unmatched tokens)
- Algorithm parameters (cut-offs, algorithm selected, etc)
- Ontology id + version
- gene association set id / species + version
- List of results - for each result:
- term ID
- optional term metadata
- list of gene IDs (+ optional gene metadata)
- scoring metadata (p-vals, rank, etc)
Optional information:
- Unique identifier/URI for the results
- Metadata on input token set (e.g. "genes up-regulated in diabetes")
- graphical output
Is is this of general interest? If so, does the above sound like a good start, and what would be an appropriate forum for future discussions? Is there an existing tool whose output might be a good candidate for standardization?
Good point and interesting paper. Yes, my list is biased towards simple gene lists. I think we would probably want a fairly generic core and extensions for GSEA, genomic intervals, etc.
Interesting topic, and clearly a need for this. Another piece of meta-data that would be good to capture is if the analysis is done at the gene list or genomic interval level, and if the latter if any corrections for genomic structure are applied, e.g. http://www.ncbi.nlm.nih.gov/pubmed/16504139