Hi everyone,
I am doing the gene prediction and annotation for the non-reference sequences, follow the MAKER annotation pipeline.
However, when I train Augustus with BUSCO 5.7.1 (installed via conda), I encounter this error:
busco -i /opt/data/sony/thesis/pan_novoseq_maker/round1/novo_pan_seq_rnd1.maker.output/snap1/non-ref_rnd1.all.maker.transcripts1000.fasta -o non-ref_rnd1_maker -l embryophyta_odb10 -m genome -c 8 -f --long --augustus --augustus_species rice --augustus_parameters='--progress=true'
2024-07-09 15:27:09 INFO: [augustus] 13 of 64 task(s) completed
2024-07-09 15:27:10 INFO: [augustus] 20 of 64 task(s) completed
2024-07-09 15:27:11 INFO: [augustus] 26 of 64 task(s) completed
2024-07-09 15:27:12 INFO: [augustus] 33 of 64 task(s) completed
2024-07-09 15:27:14 INFO: [augustus] 39 of 64 task(s) completed
2024-07-09 15:27:15 INFO: [augustus] 45 of 64 task(s) completed
2024-07-09 15:27:16 INFO: [augustus] 52 of 64 task(s) completed
2024-07-09 15:27:18 INFO: [augustus] 58 of 64 task(s) completed
2024-07-09 15:27:22 INFO: [augustus] 64 of 64 task(s) completed
2024-07-09 15:27:22 INFO: Extracting predicted proteins...
2024-07-09 15:27:22 INFO: ***** Run HMMER on gene sequences *****
2024-07-09 15:27:22 INFO: Running 61 job(s) on hmmsearch, starting at 07/09/2024 15:27:22
2024-07-09 15:27:23 INFO: [hmmsearch] 7 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 13 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 19 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 25 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 31 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 37 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 43 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 49 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 55 of 61 task(s) completed
2024-07-09 15:27:23 INFO: [hmmsearch] 61 of 61 task(s) completed
2024-07-09 15:27:23 WARNING: BUSCO did not find any match. Make sure to check the log files if this is unexpected.
2024-07-09 15:27:23 INFO: Starting second step of analysis. The gene predictor Augustus is retrained using the results from the initial run to yield more accurate results.
2024-07-09 15:27:23 INFO: Extracting missing and fragmented buscos from the file ancestral_variants...
2024-07-09 15:27:25 INFO: Running a BLAST search for BUSCOs against created database
2024-07-09 15:27:25 INFO: Running 1 job(s) on tblastn, starting at 07/09/2024 15:27:25
2024-07-09 15:27:33 INFO: [tblastn] 1 of 1 task(s) completed
2024-07-09 15:27:33 INFO: Converting predicted genes to short genbank files
2024-07-09 15:27:33 WARNING: No jobs to run on gff2gbSmallDNA.pl
2024-07-09 15:27:33 INFO: All files converted to short genbank files, now training Augustus using Single-Copy Complete BUSCOs
2024-07-09 15:27:33 INFO: Running 1 job(s) on new_species.pl, starting at 07/09/2024 15:27:33
2024-07-09 15:27:33 INFO: [new_species.pl] 1 of 1 task(s) completed
2024-07-09 15:27:33 INFO: Running 1 job(s) on etraining, starting at 07/09/2024 15:27:33
2024-07-09 15:27:34 INFO: [etraining] 1 of 1 task(s) completed
2024-07-09 15:27:34 ERROR: Retraining did not complete correctly. Check your Augustus config path environment variable.
2024-07-09 15:27:34 ERROR: BUSCO analysis failed!
2024-07-09 15:27:34 ERROR: Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues
I tried to train Augustus for the other rice accession, it worked well. However, it showed the error as above when I did train Augustus using Busco for Non-reference sequences ( Non-reference sequences are the assembled sequences from unmapped reads when I perform map the reads to reference genome)
Does anyone have experienced on this matter and How can I trouble this error ? Thank you everyone.
Have you considered the possibility that those reads are not really part of the genome and you are trying to stuff them in. Are those reads "blast"ing to something that makes sense? At this point there should be enough rice genomes available so big chunks of genome are likely not missing in databases.
I already checked and removed contaminants sequences from those non-reference sequences using FCS-GX NCBI tools