In my recent research, I tried to use 'EVAL' to generate some statistics for the genome annotation file generated from a cufflinks -> TransDecoder pipeline. However, I found this package disappointing.
My high expectations:
The Eval package had very detailed documents and promises to generate statistics about transcript length, CDS length, UTR length, and genome coverage. My goal is to evaluate my new assembly, especially in terms of the increase in 3UTR length. After reading the instructions to this tool, I thought I found a great solution.
My disappointments:
The Eval package only supports GTF2 format, with start_codon/end_codon entries as necessity:
This is reasonable. So I used gffread -T to transform the GFF3 file I got from TrandDecoder, then spent half a day writing scripts to add start/stop codon to the GTF file. I thought I'm close to getting the rewards of the result.
The Eval generates wrong 3UTR length information:
Using the GTF2 file I generated, I got very complete statistics for the 3UTR for the GTF file. I was exited...before I checked the result. Here is a list of the annoying facts:
a. generating 3UTR information for sequences without 3UTR.
b. warning 'overlapping exon region between stop codon and 3UTR. c. The transcript length reported by EVAL includes intron length, which makes the result completely useless for me. d. The *validate_gtf.pl -f* is supposed to fix formatting issue in any GTF2 file and infer UTR lines from 'Exon' and 'CDS' information, however, it adds UTR information for GTF2 files without CDS info provided.
My painful lesson learned:
The EVAL package have serious bugs while the documentation looks so detailed and nice.
(#EVAL Package: Keibler, E., & Brent, M. R. (2003). Eval: a software package for analysis of genome annotations. BMC Bioinformatics, 4(1), 50. doi:10.1186/1471-2105-4-50)
I don't find it very surprising that a 13 years old software does not support GFF3, I think the format was defined after that, but it is a little hard to find a resource on when exactly it was conceived. Most statistics you mention could potentially be trivial to implement using the Bio* parsers of today, or also Bioconductor. If you think that it is still relevant for your project to calculate statistics for a genome annotation, you could provide a list of the requirements, and we could check them against existing libraries and questions on Biostars.
Useful! ...so that I don't fall into wasting time on it.