Is BUSCO really better than CEGMA for genome assembly quality evaluation?
2
1
Entering edit mode
6.9 years ago
shelkmike ★ 1.4k

BUSCO is a successor to CEGMA and is often spoken about as being superior. However, I doubt that this is so. The thing is that CEGMA uses a set of ultra-conservative genes - the ones that are present in human, mouse, fruit fly, nematode, arabidopsis and yeasts. On the contrary, BUSCO uses genes that are single copy in at least 90% of species, thus the BUSCO criterion for inclusion of a gene in a reference set is less strict.

Thus, when I assemble a genome of some species and see that there are 95% of the CEGMA genes, I may be almost sure that approximately 95% of all genes of the species are assembled, since if a gene is present in human, mouse, fruit fly, nematode, arabidopsis and yeasts, it should be present in almost all eukaryotes, except some very exotic. On the other side, when I see that there are 95% of the BUSCO genes in my assembly, this doesn't really tell me how good my assembly is, since there is an ambiguity: the genome of my species may contain 95% of the BUSCO genes and thus the assembly is perfect, or, alternatively, the genome may contain 100% of the BUSCO genes and then the assembly is not perfect.

The question is: am I right that BUSCO is worse than CEGMA for estimation of assembly completeness?

Genome assembly BUSCO CEGMA • 5.5k views
ADD COMMENT
3
Entering edit mode
6.9 years ago

tough question ;-)

I can only point you to this publication which sheds some more light on this issue.

http://www.plantcell.org/content/28/8/1759

long story short : they're both not optimal ;-) and probably there is no optimal one (yet)....

ADD COMMENT
0
Entering edit mode

Thanks for this reference. Makes for a good read!

ADD REPLY
0
Entering edit mode
6.7 years ago
h.mon 35k

since there is an ambiguity: the genome of my species may contain 95% of the BUSCO genes and thus the assembly is perfect, or, alternatively, the genome may contain 100% of the BUSCO genes and then the assembly is not perfect.

Now this is a mind-bender, I really can't understand this conclusion.

I think BUSCO main improvements over CEGMA are 1) the use of clade-specific genes, which allows for a greater number of genes, thus greater precision at quality estimation; and 2) use of up-to-date database. Indeed, BUSCO implements ideas the authors of CEGMA intended to implement, but didn't because lack of funding:

One planned aspect of 'CEGMA v3' was to replace the reliance on the aging KOGs database. Another aspect of the new version of CEGMA would be to develop clade-specific sets of core genes.

And:

BUSCO seems to do everything that we wanted to include in CEGMA v3 and it is based on OrthoDB, a resource that has generated a new set of orthologs (developed by the same authors).

ADD COMMENT
0
Entering edit mode

Thank you for your response. I'll try to reformulate in simpler words:

1) The CEGMA's protein set has a shortcoming of having too few proteins (248, to be precise)

2) The BUSCO's sets shortcoming is that they contain proteins that are single-copy in 90% of species, not 100%.

Why is it a commonplace to suppose that the second shortcoming is more negligible than the first?

ADD REPLY
1
Entering edit mode

Have a look/read of the paper I posted above ;)

1) this is a way too restrictive approach of CEGMA we learned in the meanwhile

2) being single copy in 100% of cases does not make much (biological) sense as being single copy is just a snapshot in time situation (I'm mainly talking from a plant perspective here), so SC in 100% of species will drop out lots of informative 'genes' . Nonetheless this set already covers a much bigger range of protein sequences so that is why people likely prefer BUSCO over CEGMA

ADD REPLY
1
Entering edit mode

Thank you, I have already read the article, but haven't found a clear answer there. I supposed, maybe some of BioStars' members have a more unambiguous answer.

ADD REPLY

Login before adding your answer.

Traffic: 2675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6