Hello everyone,
I created a trained model for Drosophila mojavensis in Augustus using its genome and gff. However, when using this model with the Drosophila rucux genome, it generates gene overexpression compared to the annotation when using the default Drosophila melanogaster model (species = fly).
Is it normal for Augustus to inflate genes in the annotation of new species with limited data?
Could the excessive number of genes be false positives?
I think your question could do with some more information. By gene overexpression do you mean just inflated gene counts in the annotation? Some variability in the number of genes is expected. Large differences not so much.
The more evidences you use in your annotations the better they will be. I recommend trying a diversity of tools with as much evidence as you can gather (RNA-Seq in particular) and then assess your annotation with BUSCO.
Consider using BRAKER3 and for something different, Helixer
yes it is very well possible those are over-predictions or false positives (not over-expression, that's another kind of analysis ;) ) .
However, without looking into the details of those genes it's hard to say. You can for instance do a blast analysis with the suspicious genes and see what the hits are (if any), if no hits are found they are indeed likely over-prediction.
How do you determine (or think) that there is over-prediction? just based on numbers or did you do a gene comparison of the two predictions? You could also see gene-split, a single gene being split in 2 in your prediction compared to the one with the default model. There a quite a number of factors and outcomes that are in play here.
While I'm usually always pro building your own models, you should realize that this needs to be done thoroughly and with the correct insights & data. Failing to do so could result in a model that performs poorly.
Do yo have any a-priori indication that the default model is performing sub-optimal? (== why did you decide to build your own model for this species?)
I think your question could do with some more information. By gene overexpression do you mean just inflated gene counts in the annotation? Some variability in the number of genes is expected. Large differences not so much.
The more evidences you use in your annotations the better they will be. I recommend trying a diversity of tools with as much evidence as you can gather (RNA-Seq in particular) and then assess your annotation with BUSCO.
Consider using BRAKER3 and for something different, Helixer