Genome Annotation Quality Measure
2
11
Entering edit mode
14.0 years ago
Darked89 4.7k

I have output of several gene prediction programs (using term loosely):

  • de novo predictors (Augustus, GlimmerHMM, Geneid, SNAP, Genscan)
  • RNA-Seq mapped with Tophat and Cufflinks
  • EST sets mapped with PASA (same species)/GMAP (2M+ plant ESTs)
  • 10 protein sets mapped with exonerate

I also got:

  • semi-curated set of 1000 proteins (= non chimeric, non truncated, with correct size and similarity to other plant proteins, but exon borders may be at times wrong/small introns retained), ca 700 of them unique at 50% protein similarity level (uclust)

  • 400+ CEGMA predictions based on HMM profiles of conserved set of genes

So far Augustus with RNA-Seq evidence support is way ahead at predicting sensible genes. I have been comparing numbers of "exons" shared between these sets, and I am puzzled by large numbers of exons unique for almost every method used. While this would be normal for de novo predictors, I was hoping that homology based methods (i.e exonerate protein to genome, GMAP and cufflinks) should overlap way more. I am going to work on improving individual programs results were possible (retraining, better filtering of ESTs/proteins, etc.).

I am looking for to some genome wide measure, telling me how good I am doing, be it for individual gene prediction program or some prediction combiner, as say compared to Arabidopsis and two three other recently annotated plant genomes. Any ideas?

genome gene • 6.4k views
ADD COMMENT
5
Entering edit mode
14.0 years ago

Annotation Edit Distance devised by Eilbeck et al. might suit your needs, or be a place from which to start. From the paper: "AED is similar to performance measures employed by the gene-prediction community, but takes into account aspects of annotations not well addressed by conventional sensitivity/specificity measures such as alternative splicing."

ADD COMMENT
0
Entering edit mode

Great link. I've never seen this paper. I'll need to read it in detail. Probably has some applicability to what I'm currently working on!

ADD REPLY
0
Entering edit mode

...However, AED also looks at individual annotations rather than giving a global measure, which I think is what is being asked here.

ADD REPLY
0
Entering edit mode

Well, a global measure is a matter of aggregating the individual measurements. The paper plots cumulative AED for some genome releases over time. Or one might restrict the calculation to a subset of particularly important features for that organism, YMMV.

ADD REPLY
0
Entering edit mode

Thanks a lot, I will need some time to digest it.

ADD REPLY
3
Entering edit mode
14.0 years ago

There was a thread that talked about this a while back with regards to individual gene models...indeed, you responded to it! (How to compare gene models) So if I understand correctly, you now want to know how to get a higher-level view rather than a per-gene-model comparison?

I spent a bit of time recently looking for software to do this and found little. Consequently, I've spent some time recently working on on a perl application to compare two sets of annotations. One set is treated as a reference, the other is treated as predictions, and it compares exon structure and coding nucleotide agreement.

It's not ready for prime time yet (there are a few small bugs and it still doesn't handle alternative splicing very well), but I've used it to do some comparisons and it has been very helpful. By default it provides a separate comparison for each gene model, but I should be able to force it to the whole sequence all at once (alternative splicing might complicate that, but I may be able to get something to work).

Let me know if you would like to talk details.

ADD COMMENT
0
Entering edit mode

re gff comparison tools: in a less hectic moment I am going to list all what I have found (with some comments) on one page. Seems that some ppl use eval from Michael R Brent lab http://mblab.wustl.edu/software/eval/

ADD REPLY

Login before adding your answer.

Traffic: 2715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6