The BUSCO assessment of the MAKER annotation results shows a low completeness
0
0
Entering edit mode
8 weeks ago
mut • 0

I performed the first round of MAKER annotation for my genome using RNA-seq and homologous proteins. The BUSCO assessment of the resulting proteins showed a completeness of only 64%. I then selected genes with AED less than 0.1 to train Augustus and SNAP, and performed a second round of MAKER annotation. However, the BUSCO assessment of the resulting proteins is still very low, only 53%. Why is this happening?

enter image description here

BUSCO MAKER • 1.5k views
ADD COMMENT
0
Entering edit mode

What did you do before MAKER annotation, did you do repeat masking?

ADD REPLY
0
Entering edit mode

I used RepeatMasker independently and generated a GFF3 format file, which was used as input for MAKER

#

enter image description here

"I would like to know if it is normal that the BUSCO evaluation of the proteins and transcripts output from the first round, where I only used EST and protein for annotation, is only around 60%. When I use the output from the first round to train de novo prediction tools such as Augustus and SNAP and run the second round of MAKER, the BUSCO evaluation of the proteins is still very poor. Why is that?"

ADD REPLY
0
Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

code_formatting

ADD REPLY
0
Entering edit mode

I think 60% is way too low

ADD REPLY
0
Entering edit mode

What's the completeness of the genome alone?

ADD REPLY
0
Entering edit mode

BUSCO assesses the completeness of the genome as 96%.

ADD REPLY
0
Entering edit mode

Then clearly something is wrong with the annotation by MAKER .. (as this number indicates that the built-in annotation process of BUSCO is out-performing MAKER)

ADD REPLY
0
Entering edit mode

I did not try Maker before, but my colleagues did. I am not sure the BUSCO, but number of the predicted genes was around 50,000 in Tilapia. We cannot sure it is the problem of Maker or the repeat masking process (he did not apply repeatmodeler build custom repeat database first), and there are more pipeline tools can be applied, so we stop using it now. You can try:

  1. give up Maker, use Braker or Ginger
  2. Inspect the result of repeat masking, if it masks too much or other problems
ADD REPLY
0
Entering edit mode

Can you try without rnaseq data?

ADD REPLY
0
Entering edit mode

How did you run BUSCO? and what database of BUSCO did you use ? What version of MAKER ?

ADD REPLY
0
Entering edit mode

I used the BUSCO embryophyta_odb10 database and Maker v3.01.03

ADD REPLY
0
Entering edit mode

I think you should check the quality of your assembled genome sequences again ( completeness and fragmentation). The quality of the evidence that you use in MAKER also impacts the absence of predicted genes, which can lead to a low BUSCO completeness score. Have you tried to train SNAP with genes model which has AED < 0.5 (< 0.1 is quite strict) ? I guess you are working with plant genome ? And I think you can train Augustus by using the database of embryophyta_odb10 in BUSCO.

ADD REPLY
0
Entering edit mode

Sounds you forgot to activate an option. Can you share your maker config file?

ADD REPLY
0
Entering edit mode

This is the first round of the MAKER config file using transcript fast and protein data.

#-----Genome (these are always required) 
genome=my_genome.fa  
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3 
maker_gff= #MAKER derived GFF3 file 
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no 
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no 
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no 
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no 
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no 
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no 
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no 

#-----EST Evidence (for best results provide a file for at least one) 
est=transcripts.fasta  
altest= #EST/cDNA sequence file in fasta format from an alternate organism 
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file 
altest_gff= #aligned ESTs from a closly relate species in GFF3 format 

#-----Protein Homology Evidence (for best results provide a file for at least one) 
protein=all.pep.fa   
protein_gff=  #aligned protein homology evidence from an external GFF3 file 

#-----Repeat Masking (leave values blank to skip repeat masking)  
model_org= #select a model organism for DFam masking in RepeatMasker 
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker 
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner 
rm_gff=repeatmasker3.gff  
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no 
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction 
snaphmm= #SNAP HMM file  
gmhmm= #GeneMark HMM file  
augustus_species= #Augustus gene prediction species model  
fgenesh_par_file= #FGENESH parameter file 
pred_gff= #ab-initio predictions from an external GFF3 file 
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) 
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no 
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no 
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no 
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no 
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs 
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs 
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no 
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options 
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases 
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI) 

#-----MAKER Behavior Options 
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage) 
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors 
pred_stats=0 #report AED and QI statistics for all predictions as well as models 
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1) 
min_protein=0 #require at least this many amino acids in predicted proteins 
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no 
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no 
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no 
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason 
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no 
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
ADD REPLY
0
Entering edit mode

Below is the config file for the second round. Whether it's from the first or second round, the protein FASTA obtained shows a very low BUSCO score, only around 60%. However, when I use other software (miniprot, Transdecoder) to predict from my transcript data and protein data, the BUSCO evaluation of completeness is much higher, around 95%

#-----Genome (these are always required)
genome=my_genome.fa
organism_type=eukaryotic  # eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=  # MAKER derived GFF3 file
est_pass=0  # use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0  # use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0  # use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0  # use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0  # use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0  # use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0  # passthrough anything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=  # set of ESTs or assembled mRNA-seq in fasta format
altest=  # EST/cDNA sequence file in fasta format from an alternate organism
est_gff=round1_est.gff  
altest_gff=  # aligned ESTs from a closely related species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  # protein sequence file in fasta format (i.e., from multiple organisms)
protein_gff=round1_protein.gff  

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=  # select a model organism for DFam masking in RepeatMasker
rmlib=  # provide an organism-specific repeat library in fasta format for RepeatMasker
repeat_protein=  # provide a fasta file of transposable element proteins for RepeatRunner

rm_gff=round1_repeat.gff  

prok_rm=0  # forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1  # use soft-masking rather than hard-masking in BLAST (i.e., seg and dust filtering)

#-----Gene Prediction
snaphmm=  # SNAP HMM file
gmhmm=  # GeneMark HMM file
augustus_species=BUSCO_result_train  # Augustus gene prediction species model from BUSCO
fgenesh_par_file=  # FGENESH parameter file
pred_gff=  # ab-initio predictions from an external GFF3 file
model_gff=  # annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0  # run EvidenceModeler, 1 = yes, 0 = no

est2genome=0  # infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0  # infer predictions from protein homology, 1 = yes, 0 = no

trna=0  # find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna=  # rRNA file to have Snoscan find snoRNAs
snoscan_meth=  # -O-methylation site file to have Snoscan find snoRNAs
unmask=0  # also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap=  # allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff=  # extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C  # amino acid used to replace non-standard amino acids in BLAST databases
cpus=1  # max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000  # length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1  # skip genome contigs below this length (under 10kb are often useless)

pred_flank=200  # flank for extending evidence clusters sent to gene predictors
pred_stats=0  # report AED and QI statistics for all predictions as well as models
AED_threshold=1  # Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0  # require at least this many amino acids in predicted proteins
alt_splice=0  # Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0  # extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0  # map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0  # Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000  # length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20  # minimum intron length (used for alignment polishing)
single_exon=0  # consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250  # min length required for single exon ESTs if 'single_exon' is enabled
correct_est_fusion=0  # limits use of ESTs in annotation to avoid fusion genes

tries=2  # number of times to try a contig if there is a failure for some reason
clean_try=0  # remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0  # removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=  # specify a directory other than the system default temporary directory for temporary files
ADD REPLY
1
Entering edit mode

Like that you cannot get more than what you get during the pure evidence based annotation. Indeed the abinitio run report a gene only if supported by evidence. So activate keep_preds to get the pure abinitio gene prediction. As your fist run was better than the second I advise at the end to put together the first and second run of annotation ( with the first as reference) using agat_sp_complement_annotation.pl from agat. You will probably get a higher BUSCO than the fixed second run of maker only.

ADD REPLY
0
Entering edit mode

Wouldn't it be worth trying to use the BUSCO predictions as the first step? I think BUSCO used MetaEUK by default.

ADD REPLY

Login before adding your answer.

Traffic: 1823 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6