I am annotating a plant genome using Maker-P. I used EST and transcriptome data. I reduced the redunancy in the EST using cdhit. After three rounds of Maker( EST2genome and protein2genome followed by training SNAP twice and training Augustus twice) I now have a total set of genes. I am expecting more genes than I now have, although this is a novel genome with no reference.
How can I tell if my annotation is complete?
Thanks
What is your expectation based on? You could compare with related species.
Closely related species have gene counts of about 26,857, 23,197 , 22,427 but the paper that reported this had a Complete (%) to CEGs by CEGMA pipeline 86.29
And how many do you have?
I have 17973 with a BUSCO of C:68.4%[S:64.5%,D:3.9%],F:6.0%,M:25.6%,n:1440
The BUSCO score for the genome assembly is 93.7%
I ran BUSCO with this commanline
C:68.4%[S:64.5%,D:3.9%],F:6.0%,M:25.6%,n:1440
The BUSCO score for the genome assembly is 93.7%
You lost 25% of the Busco genes during the annotation process. This is not good
I am trying to use Braker for re-annotation and to evaluate. But BRAKER has been very difficult to use. It keeps dying without any error.
Do you have any suggestion on how to recover the lost 25% BUSCO?
Did you activate the keep_pred parameter?
No I did not activate the keep_pred. When I set keep_pred=1 it gives proteins with AED of 1 see example:
mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|6|0|661
Normal it adds prediction that do not have any support from the evidence (est or protein)
Can one proceed with these unsupported predictions?
So run with keep_preds. If you have between 25000 and 30000 genes is fine, your busco will be much better. Then yiu can also give a try without snap and check the busco. Deactivating can give better results