I have output of several gene prediction programs (using term loosely):
- de novo predictors (Augustus, GlimmerHMM, Geneid, SNAP, Genscan)
- RNA-Seq mapped with Tophat and Cufflinks
- EST sets mapped with PASA (same species)/GMAP (2M+ plant ESTs)
- 10 protein sets mapped with exonerate
I also got:
semi-curated set of 1000 proteins (= non chimeric, non truncated, with correct size and similarity to other plant proteins, but exon borders may be at times wrong/small introns retained), ca 700 of them unique at 50% protein similarity level (uclust)
400+ CEGMA predictions based on HMM profiles of conserved set of genes
So far Augustus with RNA-Seq evidence support is way ahead at predicting sensible genes. I have been comparing numbers of "exons" shared between these sets, and I am puzzled by large numbers of exons unique for almost every method used. While this would be normal for de novo predictors, I was hoping that homology based methods (i.e exonerate protein to genome, GMAP and cufflinks) should overlap way more. I am going to work on improving individual programs results were possible (retraining, better filtering of ESTs/proteins, etc.).
I am looking for to some genome wide measure, telling me how good I am doing, be it for individual gene prediction program or some prediction combiner, as say compared to Arabidopsis and two three other recently annotated plant genomes. Any ideas?
Great link. I've never seen this paper. I'll need to read it in detail. Probably has some applicability to what I'm currently working on!
...However, AED also looks at individual annotations rather than giving a global measure, which I think is what is being asked here.
Well, a global measure is a matter of aggregating the individual measurements. The paper plots cumulative AED for some genome releases over time. Or one might restrict the calculation to a subset of particularly important features for that organism, YMMV.
Thanks a lot, I will need some time to digest it.