Hi
I am annotating a de novo genome using MAKER.
I first ran maker with est and protein information from a closely related species, with est2genome and protein2genome on.
I then ran MAKER with SNAP switched on, using the output of the previous step as input for snap
I then ran MAKER with SNAP switched on 3 more times.
Each time the number of predicted genes decreased. The first run predicted ~40,000 Second run ~12,000 Third run ~1200....
This cant be correct surely? I am expecting around 10,000-20,000 for my organism.
Sorry i am new to gene prediction and annotation. What i am asking is, why is SNAP drastically reducing the number of predicted genes with each iteration?
Should i just take the nearest number to what i expected and proceed into Augustus?
Thanks
You should visualise all annotations along with protein2genome and est2genome tracks within a genome browser e.g Jbrowse. you will probably understand what is going on.
Unfortunately, sometimes you can "overtrain" your ab initio gene predictors. More information can be found through searches through the MAKER Developer Google group https://groups.google.com/forum/#!forum/maker-devel. I really don't have much experience with SNAP, rather Augustus. Training Augustus well, is actually very difficult. Sometimes, BUSCO does a better job of initial training of Augustus and retraining with MAKER derived evidence actually makes the subsequent Augustus ab initio models worse. I don't know how test the sensitivity and specificity of SNAP ab initio models, but I wrote about how to do this for Augustus ab initio models and how I trained MAKER for a dromedary camel in the "analysis-steps-for-manuscript.txt" available from the following Dryad Repostiory:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.6rp36b6
I don't completely agree with you about training Augustus with BUSCO. I had tested several times and always found that training Augustus with BUSCO rather with results from MAKER evidence-based annotation is worse. I tested it with BUSCO3 and tested again recently with BUSCO4 thinking now it could be similar results or even better than the MAKER approach but it is still not the case.
From MAKER evidence-based annotation I wrote an explanation of the workflow to select the best gene models here: gene set filter/selection for training ab initio annotation tools We automated the workflow a pipeline (recently converted from bpipe to nextflow) to train specifically Augutsus (and snap using the same selected gene models). You can find it here: https://github.com/NBISweden/pipelines-nextflow.
The difference between training Augustus within BUSCO4 and MAKER is less big but in my sense it is still worse. Here example of result on insect:
In this result I even ran MAKER using only proteins... when I use species-specific transcriptomes the Augustus training using MAKER result is even better.
@Juke-34 Thank you for the links. You are probably doing a much better job than I have done with training Augustus with MAKER predictions. I have always found the opposite between BUSCO and MAKER training Augustus, but that is at least for mammals and one turtle (Kemp's Ridley sea turtle, marsh rice rat, garden warbler, different camel species) training Augustus with BUSCO using the more comprehensive odb9 databases did not work well and found the best results with training Augustus with BUSCO using eukaryota_odb9. I am basing the "better" based on the specificity and sensitivity results reported by Augustus with such as command:
The training sets come from running the predictions from MAKER run through
autoAug.pl
, see Step 25 from the above mentioned analysis-steps-for-manuscript.txt for more details.Interesting thank you. Great work by the way.
There are many different ways to select gene models for training purpose. It is true that what you do in the work you mention is quite light (only by AED score from what I understand). I understand that BUSCO training was better in this case. In our protocol we try to follow the recommendations made from the Augustus group to select best gene models as possible... so there are few more steps.
Well, I had also tried many different things- some similar steps to your pipeline (ex: redundancy removal) and other things (ex:AED filtering, redundancy removal, and randomization), etc. Still to no avail.
Ok I have not seen the redundancy removal, this is one of the most important step
No, you were correct that I only showed AED-only filtering steps in the analysis steps, but there were some attempts at combining AED filtering, redundancy removal, and randomization as well that I didn't document but tried.
Apologies for necroing this thread.
What are your thought on using the
BRAKER2
pipeline to train Augustus?Secondly I assume your use of
MAKER
evidence is using RNA supportedMAKER
and notSNAP
based support as you useSNAP
later on in your ab inito pipeline?I am not sure about comparing
BRAKER2
with just proteins to train Augustus, but in my experience with a Dipteran fly,BRAKER2
to train Augustus with arthropoda ortho db 10 proteins and species-specific RNA-Seq reads,BRAKER2-trained Augustus
was comparable to using theBRAKER2
output processedMAKER
used to train Augustus with the pipeline from https://github.com/NBISweden/pipelines-nextflow. Did not try runningMAKER
alone then the above mentioned pipeline to compare toBRAKER2-trained Augustus
I've had a similar problem. My de novo genome and transcriptome assemblies recover ~98% BUSCOs, but my annotations are only retrieving ~2% after 3 rounds of Maker. For the protein input, I tried both a proteome of a close relative and the Uniprot/Swissprot omnibus, and ended up sticking with the latter since it gave marginally better results.
I used both Augustus and SNAP training. For Augustus, I tried training with the initial Maker round (and subsequent rounds), and when that didn't improve results I tried a predefined Augustus model based on a close model organism. But I am still only recovering < 3% BUSCOs.
I'm looking into the Nextflow pipeline, but am having trouble installing all the dependencies (e.g. tcl, go, modules, singularity) and setting up the valid project pathway (I've tried various pathways/structures). However, it seems like Nextflow's ab initio pipeline simply trains Augustus on a previous Maker gff3, so is this fundamentally any different than what I've already tried?
For people who have this problem in the future, I seem to have resolved it by creating my own custom repeat library to use with Maker. I'm still in the early rounds trained with Augustus, but recovery of BUSCOs has already increased to nearly 80%. To produce a de novo repeat library, I used the EDTA workflow, which combines several packages, including RepeatModeler and LTRharvest.
Does overtraining result in fewer predictions then?
Does this mean i should take the results of an earlier iteration? One nearer what would be expected?
As mentioned by @jean.elbers probably yes. I really advise you to to visualise the results to make sense of it. You will probably see in your case that snap prediction tends to merge loci.
I'm installing JBrowse as i type
Thanks again Juke