Question

BUSCO results vs. Orthofinder

0

Entering edit mode

2.6 years ago

alslonik ▴ 320

So, this is a conseptual question about the comparison of the two software results - BUSCO and Orthofinder.

I recently compared my BUSCO output of one of the plant genomes that I am working on with the Orthofinder result for classifying 22 plant proteomes, including the one that I analyzed with BUSCO. (the genome is C:96.6%[S:95.3%,D:1.3%],F:1.5%,M:1.9%,n:1375 in Busco 3 terms. It was BUSCO 3 at the time I checked the completeness of this particular genome).

Now I am trying to explain what confuses me:

First of all, Busco has 1375 gene models for arabidopsis that are supposed to be 1 copy genes present in all plant genomes. How could this be if my Orthofinder result gives me only 30 Orthogroups which are all "1111" (each specie has exactly one copy of the gene in the group) in only 20 plant proteomes?
If I take the AA sequences of the BUSCO output and compare them (Blast or Diamond) to the 20 proteomes, I have 1670 sig. similar genes from all the 20 proteomes, with 60% and more similarity. These genes are classified in 1009 Orthogroups from the output of the Orthofinder software. Still, only ~560 (of 1670) appear in groups that all the 20 species are present. And only 850 (out of 1670) are singletones in their Orthogroup. How come?
Was it incorrect to compare the output protein sequences of BUSCO to the other proteins and hypothesize that I will get most of them present in Orthogroups with all species present and that most will be singletons? Why? Or does it show that my orthofinder result is not correct. Do I interprete any of the result in the incorrect manner? What do I miss?

Thanks

Busco Orthofinder Orthologs • 1.4k views

ADD COMMENT • link updated 2.6 years ago by lieven.sterck 15k • written 2.6 years ago by alslonik ▴ 320

score 3 · Accepted Answer · 2022-05-12

You should read up on the way (and methods) that are used for BUSCO.

One point for instance: they say single copy in X species but that is in practice much more relaxed , if I remember well it's something like single copy in 80% of species (same for the presence, it has to be present in most but not all species). For the single copy, indeed if you are very strict there are only a few hundred genes that are actually single copy in all species considered.

While confusing there is something to say for this: you have many very recent duplicated species in plants which would drastically reduce the single copy number in the BUSCO analysis (a species like for instance poplar has nearly all its genes in duplicated and there are even worse examples).

If you are looking for an analysis that kinda combines BUSCO with genefamily analysis you can have a look in this publication: https://academic.oup.com/plcell/article/28/8/1759/6100897