Question

BUSCO analysis failed due to Duplicate of sequence in Input genome fasta. How to solve this problem ?

0

Entering edit mode

7 months ago

Sony ▴ 10

Hello everyone,

I currently working on rice genome. I am going to use genome assembly at chromosome level of two indica varieties as the reference genome: 93-11 and IR64 . After downloaded the genome assembly fasta file from ENA database, I tried to run BUSCO to check the completeness of these reference genomes and choose the best one. But I was not able to run BUSCO due to this warning as follow:

*(busco) sony@hpz6:/opt/data/sony/thesis/dataset/93-11$ busco -i 93-11.fasta -l embryophyta_odb10 -o busco_output --mode genome
2024-04-04 23:37:47 INFO:       ***** Start a BUSCO v5.7.1 analysis, current time: 04/04/2024 23:37:47 *****
2024-04-04 23:37:47 INFO:       Configuring BUSCO with local environment
2024-04-04 23:37:47 INFO:       Running genome mode
2024-04-04 23:37:47 INFO:       Downloading information on latest versions of BUSCO data...
2024-04-04 23:37:56 INFO:       Input file is /opt/data/sony/thesis/dataset/93-11/93-11.fasta
2024-04-04 23:37:59 ERROR:      Duplicate of sequence >ENA|CM012053|CM012053.1 in input file
2024-04-04 23:37:59 ERROR:      BUSCO analysis failed!
2024-04-04 23:37:59 ERROR:      Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues*

(busco) sony@hpz6:/opt/data/sony/thesis/dataset/93-11$ busco -i 93-11.fasta -l embryophyta_odb10 -o busco_output --mode genome
2024-04-04 23:37:47 INFO:       ***** Start a BUSCO v5.7.1 analysis, current time: 04/04/2024 23:37:47 *****
2024-04-04 23:37:47 INFO:       Configuring BUSCO with local environment
2024-04-04 23:37:47 INFO:       Running genome mode
2024-04-04 23:37:47 INFO:       Downloading information on latest versions of BUSCO data...
2024-04-04 23:37:56 INFO:       Input file is /opt/data/sony/thesis/dataset/93-11/93-11.fasta
2024-04-04 23:37:59 ERROR:      Duplicate of sequence >ENA|CM012053|CM012053.1 in input file
2024-04-04 23:37:59 ERROR:      BUSCO analysis failed!
2024-04-04 23:37:59 ERROR:      Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

Both of 93-11 and IR64 reference genome are assembled at chromosome level. What should I do in this case? My purpose is to construct pangenome for indica rice. Thank you everyone for your suggestion.

duplicate BUSCO. • 629 views

ADD COMMENT • link updated 7 months ago by GenoMax 147k • written 7 months ago by Sony ▴ 10

0

Entering edit mode

perhaps there is some character in the header (likely the |) that annoy busco.

ADD REPLY • link 7 months ago by andres.firrincieli 3.8k

0

Entering edit mode

Thank you for your suggestion. But earlier, I tried to run BUSCO for Nipponbare reference genome (japonica rice). It was successful to run BUSCO, although the fasta file of NIpponbare has this structure:

>ENA|CP132235|CP132235.1 Oryza sativa Japonica Group cultivar Nipponbare isolate AGIS-1.0 chromosome 1.
CTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA

busco -i Nipponbare.fasta -l embryophyta_odb10 -o busco_output --mode genome
-------------------------------------------------------------------------------------------
    |Results from dataset embryophyta_odb10                                                    |
    -------------------------------------------------------------------------------------------
    |C:99.5%[S:97.1%,D:2.4%],F:0.4%,M:0.1%,n:1614,E:18.1%                                      |
    |1607    Complete BUSCOs (C)    (of which 291 contain internal stop codons)                |
    |1568    Complete and single-copy BUSCOs (S)                                               |
    |39    Complete and duplicated BUSCOs (D)                                                  |
    |6    Fragmented BUSCOs (F)                                                                |
    |1    Missing BUSCOs (M)                                                                   |
    |1614    Total BUSCO groups searched                                                       |
    -------------------------------------------------------------------------------------------

ADD REPLY • link 7 months ago by Sony ▴ 10

0

Entering edit mode

Does it mean that you actually have more than one instance of Duplicate of sequence >ENA|CM012053|CM012053.1 in your file?

ADD REPLY • link 7 months ago by GenoMax 147k

0

Entering edit mode

I think I have duplicate in the genome fasta files that I download from ENA database.

 (base) sony@hpz6:/opt/data/sony/thesis/dataset/93-11$ grep ">ENA|CM012053|CM012053.1" 93-11.fasta
    >ENA|CM012053|CM012053.1 Oryza sativa cultivar 93-11 chromosome 1, whole genome shotgun sequence.
    >ENA|CM012053|CM012053.1 Oryza sativa cultivar 93-11 chromosome 1, whole genome shotgun sequence.

But I am not sure, if I remove the duplicate sequence above, are there any problem for downstream analysis ?. I thought that these genome was assembled at the chromosome level, so it should not have duplicate sequences.

ADD REPLY • link 7 months ago by Sony ▴ 10

0

Entering edit mode

Find out why there are duplicates. If you were to just rename the sequence header, then BUSCO will overestimate content (and call those genes duplicates).

If the duplication of the sequence is unwarranted then remove the duplicate sequence (and let ENA help desk know).

ADD REPLY • link 7 months ago by GenoMax 147k