Question

Getting different results with BUSCO for different numbers of threads

3

Entering edit mode

8.2 years ago

Chris Cole ▴ 800

I've been recommended to use BUSCO in favour of CEGMA for the assessment of some de novo transcriptome assemblies, but I'm seeing inconsistent results with repeated runs of BUSCO and different numbers of threads (1-6). Despite rerunning with the same parameters I'm getting results which vary by upto 2 BUSCOs. I can't see anywhere suggesting that BUSCO is non-deterministic so am wondering if this is a bug.

I'm using the latest version of BUSCO v1.22, hmmer v3.1b2, python v3.4.3 and blast 2.2.29+ against the eukaryotic dataset in transcriptome mode: python3 BUSCO_v1.22.py -o test -in transcripts.fasta -l /db/busco/eukaryota -m trans -c 2

Grepping for the complete BUSCOs I see this:

./run_test_pe1/short_summary_test_pe1: 302 Complete BUSCOs

./run_test_pe2/short_summary_test_pe2: 300 Complete BUSCOs

./run_test_pe3/short_summary_test_pe3: 302 Complete BUSCOs

./run_test_pe4/short_summary_test_pe4: 301 Complete BUSCOs

./run_test_pe5/short_summary_test_pe5: 301 Complete BUSCOs

./run_test_pe6/short_summary_test_pe6: 302 Complete BUSCOs

Anyone else seen this? Any comments appreciated.

busco inconsistency bug • 3.3k views

ADD COMMENT • link 8.2 years ago by Chris Cole ▴ 800

1

Entering edit mode

Don't have an answer but a wild speculation. The BUSCO python script seems to call hmmsearch which as a --seed option which is not set by the python script, which might cause some randomness in results. Second might be some sort of random sampling for calculations (evalue may be). The results should not be drastically different though. The seed option could be set by editing the python script to check if results become consistent.

ADD REPLY • link 8.2 years ago by microfuge ★ 1.9k

0

Entering edit mode

Ah, thanks. Will have a poke at that.

ADD REPLY • link 8.2 years ago by Chris Cole ▴ 800

0

Entering edit mode

I added a --seed 12345 option to all the hmmsearch commands I found in the script but still got inconsistent results. Not really a surprise as the hmmsearch default is to set a fixed seed of '42'.

I did try '--seed 0' which does generate a random seed everytime, but that didn't change anything either.

Thanks for the suggestion anyway.

ADD REPLY • link 8.2 years ago by Chris Cole ▴ 800

score 6 · Accepted Answer · 2016-09-20

Fixed it!

There's a bug in the code where BUSCO reads in the filenames in 'translated_proteins/' with os.listdir(), which returns them in an arbitrary order. Unfortunately, this order matters for BUSCO to define whether a HMMer hit is complete, duplicated or fragmented - I've not fully pinned down why.

Sorting the list of filenames fixes the inconsistency, and shouldn't have any downstream side-effects. Although, I can't be sure.

It's a trivial change and here's the patch for anyone who is interested. I've also sent it to the author.

--- BUSCO_v1.22.py      2016-07-01 16:10:50.000000000 +0100
+++ BUSCO_v1.22c.py     2016-09-20 15:43:53.850117074 +0100
@@ -833,6 +833,7 @@
 if mode == 'trans' or mode == 'transcriptome':
   print('*** Running HMMER to confirm transcript orthology ***')
   files = os.listdir('%stranslated_proteins/' % mainout)
+  files.sort()
   if os.path.exists('%shmmer_output' % mainout) == False:
     subprocess.call(['mkdir', '%shmmer_output' % mainout])
   group = ''; grouplist = []