There are a few different questions here, so:
How do I go about creating the test dataset?
One approach to create a test data set is to use the annotation corpus from a given time/date (t0
), and see if the tool predicts the annotations that have accrued between t0
and a later time/date t1
.
If you are trying to obtain older annotations to create this t0
, there is no need to go into any XML- the standard format for annotations is GAF or the companion files GPAD/GPI. The GAF is slightly easier to work with as it is human readable and does not need to be paired with a second file. You can find the releases from 2004-03-01 to current at http://release.geneontology.org. If you let us know your organism and/or area of interest, we may be able to give you more suggestions. You are welcome to contact us if you don't wish to share this information publicly.
Also, propagating to a top GO term is not recommended. The highest GO terms are "root terms": 'GO:0008150 biological_process', 'GO:0003674 molecular_function', and 'GO:0005575 cellular_component'. Direct annotations to these root terms indicate that no appropriate child term could be assigned, so these terms are intentionally uninformative. You can use a "GO slim" (aka "subset", and see our FAQ) to find higher level terms that are still informative. Tools like https://go.princeton.edu/GOTermMapper/GOTermMapper_help.shtml and https://www.yeastgenome.org/goSlimMapper (either will work for most organisms) will do this mapping for you if you supply a gene list.
Is there a standard in the field?
There is no community accepted standard test set, if this is what you are looking for, but the GO annotations are done using high standards and should be assumed to be ground truth.
In the original CAFA experiment, some experimental annotations were newly created that the developers of prediction tools had no access to, and their predictions were tested against these annotations. The issue with this is that CAFA fails to take into account the Open World assumption that is built into GO annotations (PMID: 24138813). CAFA assumes that if there was a predicted annotation that didn't materialize, then the prediction was wrong. In reality, the absence of an annotation is not evidence of the absence of that function. The second paper briefly mentions that NOT
annotations- annotations with the qualifier NOT
, which are meant to counter erroneous annotations- can be used to evaluate absence of function, but with the caveat that there are very few of these NOT
annotations in comparison to all the functions that proteins do not have.
GO's term from UniProt vs GOA GO Terms?
The data shown in UniProt is derived from the complete annotation set in GOA, the data being filtered as described here: https://www.uniprot.org/help/complete_go_annotation .
In terms of ground truth, these two sets of annotations should be equivalent.
Which one is better be used for ground truth/Why?
GO has high standards and expects both manual and computational annotations to be accurate and precise. Therefore, as mentioned above, all of the annotations are considered to be the "ground truth".
If the question is about including IEAs, evidence codes only describe the evidence each statement is based on and are not statements of the quality of the annotation. Within each evidence code classification, some methods produce annotations of higher confidence or greater specificity than other methods, in addition the way in which a technique has been applied or interpreted in a paper will also affect the quality of the resulting annotation. Thus, evidence codes cannot be used as a measure of the quality of the annotation.
See also: