Question

False positives annotation of Augustus?

0

Entering edit mode

5 weeks ago

Joseph • 0

Hello everyone, I created a trained model for Drosophila mojavensis in Augustus using its genome and gff. However, when using this model with the Drosophila rucux genome, it generates gene overexpression compared to the annotation when using the default Drosophila melanogaster model (species = fly).

Is it normal for Augustus to inflate genes in the annotation of new species with limited data?

Could the excessive number of genes be false positives?

Augustus training • 409 views

ADD COMMENT • link updated 5 weeks ago by lieven.sterck 15k • written 5 weeks ago by Joseph • 0

1

Entering edit mode

I think your question could do with some more information. By gene overexpression do you mean just inflated gene counts in the annotation? Some variability in the number of genes is expected. Large differences not so much.

The more evidences you use in your annotations the better they will be. I recommend trying a diversity of tools with as much evidence as you can gather (RNA-Seq in particular) and then assess your annotation with BUSCO.

Consider using BRAKER3 and for something different, Helixer

ADD REPLY • link 5 weeks ago by Jack Tierney ▴ 420

score 1 · Answer 1 · 2025-04-30

Hi,

yes it is very well possible those are over-predictions or false positives (not over-expression, that's another kind of analysis ;) ) . However, without looking into the details of those genes it's hard to say. You can for instance do a blast analysis with the suspicious genes and see what the hits are (if any), if no hits are found they are indeed likely over-prediction.

How do you determine (or think) that there is over-prediction? just based on numbers or did you do a gene comparison of the two predictions? You could also see gene-split, a single gene being split in 2 in your prediction compared to the one with the default model. There a quite a number of factors and outcomes that are in play here.

While I'm usually always pro building your own models, you should realize that this needs to be done thoroughly and with the correct insights & data. Failing to do so could result in a model that performs poorly.

Do yo have any a-priori indication that the default model is performing sub-optimal? (== why did you decide to build your own model for this species?)