I did run exonerate 2.2.0 run in a client server mode as follows::
nohup exonerate my_proteome.pep.frag001 localhost:12901 --model p2g --geneseed 250 --showtargetgff yes \
--ryo ">%qi length=%ql alnlen=%qal\n>%ti length=%tlalnlen=%tal\n" \
--showvulgar no --showalignment no 2> nohup.exonerate.my_proteome.pep.frag001 exonerate_my_proteome.pep.frag001.out &
and in a more sensitive mode without "--geneseed 250" option, then converted the output to gff3 using process_exonerate_gff3.pl script.
In both cases result files are highly redundant (multiple matches of similar proteins to one genome fragment). Some are most likely artifacts (i.e. a protein match jumping over 100kb full of other genes). Also since neither draft genome file nor protein library (i.e. A.thaliana) have been masked/cleaned from repetitive sequences I am getting at times thousands of hits (= one protein -> multiple genome segments). The last problem can be partially fixed (I got incomplete DNA repeat library and A.thaliana proteins can be cleaned up based on descriptions and hmmer search with pfam_07727 domain) but even after that there is a number of proteins (i.e with pentatricopeptide repeat) mapping almost everywhere.
Also it seems that increasing the running time sevenfold (running without --geneseed 250 option) generates more spurious repetitive matches
Hence my questions:
- what are the recommended ways of running exonerate in p2g mode?
- how hard to mask genome? (RepeatMasker mode)
- other PFAM domains used to get rid of repeat proteins?
- is there a great advantage of "--refine region" switch?
- last but not least: do you use any gff/exonerate output "cleaners" to get rid of suspicious or simply redundant matches?
I am using it for novel plant genome annotation. Plant protein data sets are either fishy (lot of repeats, bad predictions etc.) or curated but limited. So one can not expect to take a proteome of A and map it 1:1 to B. Moreover there seem to be whole functional I presume protein families with protein repeats (i.e. pentatricopeptide repeat) with hundreds of them in A.thaliana. Add to it tandemly repeated nearly identical genes. While it is not a total mess (results looks mostly sensible) there is a lot of cases where exonerate gene models simply fail.
I understand, it must be a problem of the dataset. An alternative may be blat, which is also designed to align proteins and mRNA to genome, but you will have to install locally. The problem is that plants have a lot of duplications :-(