I am trying to use CONRAD for gene prediction using novel plant genome sequence. As a training set it asks for at least 200 correct genes, but using more of them should give me more accurate predictions.
Here I bump into chicken and egg problem: it is hard to get even 200 reliable genes for a novel plant (NCBI has 500+ protein entries but these are mostly mitochondrial / chloroplast genes). I already have AUGUSTUS predicted genes, some with 100% level of RNA-Seq support, but these are at least few problems with them:
- doe to no training set, AUGUSTUS was run with Arabidopsis gene models
- it is a stretch to call an output of a gene predictor "reliable"
- spliced RNA-Seq mappers and gene predictors cut corners using canonical splice sites
I have ca 30k of ESTs from the plant but from various strains, several lanes of Illumina RNA-Seq also from various strains. Genomic sequence (454 mostly) is at draft stage, with multiple gaps likely swallowing some exons. Few Sanger sequenced BACs.
My idea would be to:
- start with 1000 (2000?) (non-mitochondrial and non-chloroplast) proteins most highly conserved among 5(?) plant species
- filter those against repeat library (transposons etc.),
- map these to genome using exonerate
- map all RNA-Seq and all ESTs to regions identified above
- assembly RNA-Seq and all ESTs identifies in the previous step => cDNAs
- check if cDNAs translations are sane by comparing them to protein sequences from other species
- align reliable cDNAs to genome
- manually check as many genes from this final set as possible.
My questions:
- is it possible to speed up the whole procedure?
- going in the opposite direction: how to improve it/make it better?
novel plant genome? how novel are we talking about? the odds are there are some closely related species you can use to map cDNAs with GMAP. also try to use a couple of ab initio prediction softwares and use a combiner to get the consensus.
I think your approach looks very well thought out already. The mapping of transcripts (RNA-seq) is believed to be state-of-the-art in gene-structure predictions. Just remember full-length(?) cDNA was used in Medicago truncatula gene annotation for training. Possible improvement: There are many more euk. gene predition tools on the market (eg. EUGENE), I can post a list if you like.
It is from amaranth family: not that novel/strange. Tblastn of Augustus predictions picks sensible ESTs from multiple species for parts of proteins not recognized by blastp. I am a bit concerned that almost everything what can be spotted by GMAP on nucleotide level will be already detected by exonerate using protein2genome. But I will check GMAP with other species ESTs.