Hi, Does anyone know any software that can (more or less) easily do the following two tasks, ideally something that gets updated time to time:
1) model splice sites based on ESTs to an assembled genome instead of the usual AG-GT, and...
2) use a set of protein queries to best predict genes in the genome to full length.
I am looking at published genomes, so there are usually ESTs, and scaffolds.
A putative gene that is 60% of the aligned length is not a gene, but rather a waste of my time. If animal, plant, and fungi all have a protein of 400AAs and I don't find one of similar length in a new genome, it seems more likely to me that an exon is missing due to bad gene prediction rather than a domain got moved, especially on a very conserved protein. Particularly if I can look at multiple alignments and immediately tell that the n-terminus is missing, I can be sure that the program didn't work hard enough to find that last exon.
I was hoping that this wouldn't be necessary, but a number of published genomes were too reliant on ab initio gene prediction and consequently ended up with a lot of incomplete genes or, in some cases, total nonsense. I won't name any names to protect those involved, but I think that after the Nature papers were published, the genomes were basically forgotten and left in an unusable 'draft' state.