Question

I need to define a ground truth/correct answer set of GO terms for proteins

0

Entering edit mode

20 months ago

Emiliat • 0

I need to compare the predictions of a GO term prediction method, across some dataset.

How do I go about creating the test dataset? I want each protein in dataset to have the ground truth/correct answers GO terms, so that I can compare the 'correct' answer to the predictions generated by the prediction method.

I was thinking about using downloading the XML for SwissProt, and filtering to include any evidence for the GO except IEA/Inferred from Electronic Annotation and propagating it to the top GO term.

Is there is a standard in the field? It seems that people are using experimental evidence, with reference to CAFA, but I want to get a complete list of GO terms for a series of proteins, and that why I though about using all evidence codes except IEA. I am looking for paper that defines a 'trend/standard' in the field, and that say's, something like this "...according to accepted values in this field, the IEA were ignored when making this dataset..."

GO's term from UniProt vs GOA GO terms?

I know that UniProt only lists the chidren's term, and those GO terms are located on a better characterized proteins, while GOA mainly uses IEA tags to identify the GO terms to the proteins.

Which one is better be used for ground truth/Why?

Thank you in advance for everyone

SwissProt GO-term GOA • 1.1k views

ADD COMMENT • link updated 19 months ago by geneontologyhelp ▴ 420 • written 20 months ago by Emiliat • 0

score 5 · Answer 1 · 2023-05-18

There are a few different questions here, so:

How do I go about creating the test dataset?

One approach to create a test data set is to use the annotation corpus from a given time/date (t0), and see if the tool predicts the annotations that have accrued between t0 and a later time/date t1.

If you are trying to obtain older annotations to create this t0, there is no need to go into any XML- the standard format for annotations is GAF or the companion files GPAD/GPI. The GAF is slightly easier to work with as it is human readable and does not need to be paired with a second file. You can find the releases from 2004-03-01 to current at http://release.geneontology.org. If you let us know your organism and/or area of interest, we may be able to give you more suggestions. You are welcome to contact us if you don't wish to share this information publicly.

Also, propagating to a top GO term is not recommended. The highest GO terms are "root terms": 'GO:0008150 biological_process', 'GO:0003674 molecular_function', and 'GO:0005575 cellular_component'. Direct annotations to these root terms indicate that no appropriate child term could be assigned, so these terms are intentionally uninformative. You can use a "GO slim" (aka "subset", and see our FAQ) to find higher level terms that are still informative. Tools like https://go.princeton.edu/GOTermMapper/GOTermMapper_help.shtml and https://www.yeastgenome.org/goSlimMapper (either will work for most organisms) will do this mapping for you if you supply a gene list.

Is there a standard in the field?

There is no community accepted standard test set, if this is what you are looking for, but the GO annotations are done using high standards and should be assumed to be ground truth.

In the original CAFA experiment, some experimental annotations were newly created that the developers of prediction tools had no access to, and their predictions were tested against these annotations. The issue with this is that CAFA fails to take into account the Open World assumption that is built into GO annotations (PMID: 24138813). CAFA assumes that if there was a predicted annotation that didn't materialize, then the prediction was wrong. In reality, the absence of an annotation is not evidence of the absence of that function. The second paper briefly mentions that NOT annotations- annotations with the qualifier NOT, which are meant to counter erroneous annotations- can be used to evaluate absence of function, but with the caveat that there are very few of these NOT annotations in comparison to all the functions that proteins do not have.

GO's term from UniProt vs GOA GO Terms?

The data shown in UniProt is derived from the complete annotation set in GOA, the data being filtered as described here: https://www.uniprot.org/help/complete_go_annotation .

In terms of ground truth, these two sets of annotations should be equivalent.

Which one is better be used for ground truth/Why?

GO has high standards and expects both manual and computational annotations to be accurate and precise. Therefore, as mentioned above, all of the annotations are considered to be the "ground truth".

If the question is about including IEAs, evidence codes only describe the evidence each statement is based on and are not statements of the quality of the annotation. Within each evidence code classification, some methods produce annotations of higher confidence or greater specificity than other methods, in addition the way in which a technique has been applied or interpreted in a paper will also affect the quality of the resulting annotation. Thus, evidence codes cannot be used as a measure of the quality of the annotation.