I've been recommended to use BUSCO in favour of CEGMA for the assessment of some de novo transcriptome assemblies, but I'm seeing inconsistent results with repeated runs of BUSCO and different numbers of threads (1-6). Despite rerunning with the same parameters I'm getting results which vary by upto 2 BUSCOs. I can't see anywhere suggesting that BUSCO is non-deterministic so am wondering if this is a bug.
I'm using the latest version of BUSCO v1.22, hmmer v3.1b2, python v3.4.3 and blast 2.2.29+ against the eukaryotic dataset in transcriptome mode: python3 BUSCO_v1.22.py -o test -in transcripts.fasta -l /db/busco/eukaryota -m trans -c 2
Grepping for the complete BUSCOs I see this:
./run_test_pe1/short_summary_test_pe1: 302 Complete BUSCOs
./run_test_pe2/short_summary_test_pe2: 300 Complete BUSCOs
./run_test_pe3/short_summary_test_pe3: 302 Complete BUSCOs
./run_test_pe4/short_summary_test_pe4: 301 Complete BUSCOs
./run_test_pe5/short_summary_test_pe5: 301 Complete BUSCOs
./run_test_pe6/short_summary_test_pe6: 302 Complete BUSCOs
Anyone else seen this? Any comments appreciated.
Don't have an answer but a wild speculation. The BUSCO python script seems to call hmmsearch which as a --seed option which is not set by the python script, which might cause some randomness in results. Second might be some sort of random sampling for calculations (evalue may be). The results should not be drastically different though. The seed option could be set by editing the python script to check if results become consistent.
Ah, thanks. Will have a poke at that.
I added a --seed 12345 option to all the hmmsearch commands I found in the script but still got inconsistent results. Not really a surprise as the hmmsearch default is to set a fixed seed of '42'.
I did try '--seed 0' which does generate a random seed everytime, but that didn't change anything either.
Thanks for the suggestion anyway.