Entering edit mode
9 weeks ago
mut
•
0
I performed the first round of MAKER annotation for my genome using RNA-seq and homologous proteins. The BUSCO assessment of the resulting proteins showed a completeness of only 64%. I then selected genes with AED less than 0.1 to train Augustus and SNAP, and performed a second round of MAKER annotation. However, the BUSCO assessment of the resulting proteins is still very low, only 53%. Why is this happening?
What did you do before MAKER annotation, did you do repeat masking?
I used RepeatMasker independently and generated a GFF3 format file, which was used as input for MAKER
#
"I would like to know if it is normal that the BUSCO evaluation of the proteins and transcripts output from the first round, where I only used EST and protein for annotation, is only around 60%. When I use the output from the first round to train de novo prediction tools such as Augustus and SNAP and run the second round of MAKER, the BUSCO evaluation of the proteins is still very poor. Why is that?"
Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.
I think 60% is way too low
What's the completeness of the genome alone?
BUSCO assesses the completeness of the genome as 96%.
Then clearly something is wrong with the annotation by MAKER .. (as this number indicates that the built-in annotation process of BUSCO is out-performing MAKER)
I did not try Maker before, but my colleagues did. I am not sure the BUSCO, but number of the predicted genes was around 50,000 in Tilapia. We cannot sure it is the problem of Maker or the repeat masking process (he did not apply repeatmodeler build custom repeat database first), and there are more pipeline tools can be applied, so we stop using it now. You can try:
Can you try without rnaseq data?
How did you run BUSCO? and what database of BUSCO did you use ? What version of MAKER ?
I used the BUSCO embryophyta_odb10 database and Maker v3.01.03
I think you should check the quality of your assembled genome sequences again ( completeness and fragmentation). The quality of the evidence that you use in MAKER also impacts the absence of predicted genes, which can lead to a low BUSCO completeness score. Have you tried to train SNAP with genes model which has AED < 0.5 (< 0.1 is quite strict) ? I guess you are working with plant genome ? And I think you can train Augustus by using the database of embryophyta_odb10 in BUSCO.
Sounds you forgot to activate an option. Can you share your maker config file?
This is the first round of the MAKER config file using transcript fast and protein data.
Below is the config file for the second round. Whether it's from the first or second round, the protein FASTA obtained shows a very low BUSCO score, only around 60%. However, when I use other software (miniprot, Transdecoder) to predict from my transcript data and protein data, the BUSCO evaluation of completeness is much higher, around 95%
Like that you cannot get more than what you get during the pure evidence based annotation. Indeed the abinitio run report a gene only if supported by evidence. So activate keep_preds to get the pure abinitio gene prediction. As your fist run was better than the second I advise at the end to put together the first and second run of annotation ( with the first as reference) using agat_sp_complement_annotation.pl from agat. You will probably get a higher BUSCO than the fixed second run of maker only.
Wouldn't it be worth trying to use the BUSCO predictions as the first step? I think BUSCO used MetaEUK by default.