Question

Repairing Old Genomes With Homology Based Gene Prediction

1

Entering edit mode

12.1 years ago

Wrf ▴ 210

Hi, Does anyone know any software that can (more or less) easily do the following two tasks, ideally something that gets updated time to time:

1) model splice sites based on ESTs to an assembled genome instead of the usual AG-GT, and...

2) use a set of protein queries to best predict genes in the genome to full length.

I am looking at published genomes, so there are usually ESTs, and scaffolds.

A putative gene that is 60% of the aligned length is not a gene, but rather a waste of my time. If animal, plant, and fungi all have a protein of 400AAs and I don't find one of similar length in a new genome, it seems more likely to me that an exon is missing due to bad gene prediction rather than a domain got moved, especially on a very conserved protein. Particularly if I can look at multiple alignments and immediately tell that the n-terminus is missing, I can be sure that the program didn't work hard enough to find that last exon.

I was hoping that this wouldn't be necessary, but a number of published genomes were too reliant on ab initio gene prediction and consequently ended up with a lot of incomplete genes or, in some cases, total nonsense. I won't name any names to protect those involved, but I think that after the Nature papers were published, the genomes were basically forgotten and left in an unusable 'draft' state.

genome hmm homology • 2.9k views

ADD COMMENT • link updated 9.4 years ago by jens.keilwagen ▴ 10 • written 12.1 years ago by Wrf ▴ 210

score 0 · Answer 1 · 2013-09-02

This is a widespread problem, and, as far as I know, there is no general solution.

Unfortunately, the problem is more widespread than you outline, because the protein databases have become contaminated. We have recently been looking at Interpro/Pfam domain diagrams on proteins, and found problems like the one you outline. If you look at Pfam topologies, and find that 90+ % of the proteins are domainA-domainB, but then a domain is missing because the protein is truncated in a small number of exames, or a domainC is added because additional sequence is present a small fraction of the time, these "proteins" can be derived by either genome mis-assembly or bad gene prediction.

Unfortunately, once the protein has made it into UniprotKB or RefSeq, few people have the time (or data access) to understand what went wrong or how to fix it.

score 0 · Answer 2 · 2016-04-20

Hi, this question has been raised more than two years ago. However, if anyone is still interested in this topic, here is another answer: Tools like exonerate and genBlastG can be used. These tools are looking for genomic regions encoding for a given protein. An additional feature is the utilization of intron position conservation, i.e., the sites in the amino acid sequence where exons are concatenated. Tools like GeneMapper and Projector are using intron position conservation, but up to my knowledge they are no longer available.

Recently, we published an alternative approach called GeMoMa (https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092). GeMoMa utilized intron position conservation and utilizes in the first version GT/GC-AG splice sites. However, we are currently working on an extension using RNAseq data (and splice site models) to further improve the predictions.

Nevertheless, you should keep in mind that the success of homology-based gene prediction highly depends on the quality of the reference annotation, the quality of the target genome (assembly) and the evolutionary distance between the reference organism (the annotated organism) and the target organism (the organism to be annotated) besides the choice of the algorithm.

Best regards, Jens