Entering edit mode
7 months ago
Sony
▴
10
Hello everyone,
I currently working on rice genome. I am going to use genome assembly at chromosome level of two indica varieties as the reference genome: 93-11 and IR64 . After downloaded the genome assembly fasta file from ENA database, I tried to run BUSCO to check the completeness of these reference genomes and choose the best one. But I was not able to run BUSCO due to this warning as follow:
*(busco) sony@hpz6:/opt/data/sony/thesis/dataset/93-11$ busco -i 93-11.fasta -l embryophyta_odb10 -o busco_output --mode genome
2024-04-04 23:37:47 INFO: ***** Start a BUSCO v5.7.1 analysis, current time: 04/04/2024 23:37:47 *****
2024-04-04 23:37:47 INFO: Configuring BUSCO with local environment
2024-04-04 23:37:47 INFO: Running genome mode
2024-04-04 23:37:47 INFO: Downloading information on latest versions of BUSCO data...
2024-04-04 23:37:56 INFO: Input file is /opt/data/sony/thesis/dataset/93-11/93-11.fasta
2024-04-04 23:37:59 ERROR: Duplicate of sequence >ENA|CM012053|CM012053.1 in input file
2024-04-04 23:37:59 ERROR: BUSCO analysis failed!
2024-04-04 23:37:59 ERROR: Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues*
(busco) sony@hpz6:/opt/data/sony/thesis/dataset/93-11$ busco -i 93-11.fasta -l embryophyta_odb10 -o busco_output --mode genome
2024-04-04 23:37:47 INFO: ***** Start a BUSCO v5.7.1 analysis, current time: 04/04/2024 23:37:47 *****
2024-04-04 23:37:47 INFO: Configuring BUSCO with local environment
2024-04-04 23:37:47 INFO: Running genome mode
2024-04-04 23:37:47 INFO: Downloading information on latest versions of BUSCO data...
2024-04-04 23:37:56 INFO: Input file is /opt/data/sony/thesis/dataset/93-11/93-11.fasta
2024-04-04 23:37:59 ERROR: Duplicate of sequence >ENA|CM012053|CM012053.1 in input file
2024-04-04 23:37:59 ERROR: BUSCO analysis failed!
2024-04-04 23:37:59 ERROR: Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues
Both of 93-11 and IR64 reference genome are assembled at chromosome level. What should I do in this case? My purpose is to construct pangenome for indica rice. Thank you everyone for your suggestion.
perhaps there is some character in the header (likely the
|
) that annoy busco.Thank you for your suggestion. But earlier, I tried to run BUSCO for Nipponbare reference genome (japonica rice). It was successful to run BUSCO, although the fasta file of NIpponbare has this structure:
Does it mean that you actually have more than one instance of Duplicate of sequence
>ENA|CM012053|CM012053.1
in your file?I think I have duplicate in the genome fasta files that I download from ENA database.
But I am not sure, if I remove the duplicate sequence above, are there any problem for downstream analysis ?. I thought that these genome was assembled at the chromosome level, so it should not have duplicate sequences.
Find out why there are duplicates. If you were to just rename the sequence header, then BUSCO will overestimate content (and call those genes duplicates).
If the duplication of the sequence is unwarranted then remove the duplicate sequence (and let ENA help desk know).