I've successfully managed to run BUSCO on transcriptome data with some problems (see here), but it's working now.
In genome mode, however, I'm struggling to get decent results and I can't figure out why? For the exact same species in transcriptome mode I get 342 (79%) 'Complete BUSCOs', yet in genome mode I only get 151 (35%).
Is the transcriptome extracted from the genome using gene annotations? If not, it's possible that your transcriptome is just more complete/correct than the genome.
I have heard (second-hand, admittedly) of cases where transcripts map back to the genome >95%, but the genome completeness is low; see my below reply re: nematode. However, if one uses the transcripts to derive gene models (e.g. using MAKER), then uses BUSCO on the gene model sequence, the % completeness goes up.
Based on the BUSCO manual:
BUSCO genome assembly assessment first identifies candidate regions
from the genome to be assessed with tBLASTn searches using BUSCO
consensus sequences. Gene structures are then predicted using Augustus
with BUSCO block profiles. Finally, these predicted genes, or all
genes from an annotated gene set or transcriptome, are assessed using
HMMER and lineage- specific BUSCO profiles to classify matches as
complete, duplicated, or fragmented, or when there are no matches, as
missing.
So, maybe Augustus has a hard time deriving accurate gene models de novo leading to poor BUSCO scores, but when assisted using transcriptome data BUSCO works more effectively?
I have seen this for some nematode genomes, even when doing the extended run; the % varies quite a bit but is always low (in some cases, less than 20%). Interestingly, CEGMA gave more consistent results.
We have wondered whether this has something to do w/ Augustus making poor calls, though I'm not sure how BUSCO is using it internally.
I have tried to decipher the code, but am struggling to work out where it's going wrong. I agree with you, it's likely to be something to so with Augustus (seeing as that's the key difference between 'trans' and 'genome' mode), but am not sure whether it's the software itself or BUSCOs implementation of it.
In your nematodes have you compared the genome result with an ORF or transcriptome run?
We haven't done this directly. But we have had another group report much lower scores when using whole genome vs. just gene models (which were derived via assembled RNA-Seq + Braker I believe). It's something I'd like to confirm but I wouldn't be terribly surprised if that does hold true.
So one problem was that Augustus was crashing consistently for some genes, but BUSCO pipes all errors to /dev/null so it was never report until I removed the /dev/null redirect.
Managed to fix some of the (local) causes of failure, but am still getting core dumps for a small number genes.
I don't have a small test, these are all very large plant genomes, so probably wouldn't help much. Have you independently tried using Augustus separately or from within Maker to do gene prediction in Dictyostelium - perhaps Dicty genes are not well predicted ?
Dear Chris, Hi
Are you intend to check your transcriptome assembly (as you have mentioned : "transcriptome data") ?
or some genome assembly assessment ?
if the first is your aim you should use :
python BUSCO_v1.1b.py -o NAME -in TRANSCRIPTOME -l LINEAGE -m trans
And it is usually recommend running the most closest set available for the species being analysed. If your species is a fish, it is better to choose vertebrate instead of eukaryota.
In addition, according to the duplication situation of the genome ( showing with "D" in BUSCO results), it is possible that the number of Complete Busco results have been decreased and the number of Duplicate Busco increased.
I've already done this with CEGMA and the results are similar. A reviewer recommended we use BUSCO instead, but now am getting these very different results. Am tempted to ditch BUSCO and just stick with CEGMA.
I have given up on BUSCO v1.22 as it appears the Augustus step is not running correctly on our set-up for some reason.
Reverting to v1.1b1 gives more sensible results (70-80% complete) in genome mode, but the results in transcriptome mode are now different when compared to v1.22. Nothing in changelog suggests this should be the case.
There's something odd going with BUSCO and different versions, but I don't currently have time to dig anymore into this.
It's a long time ago and I'm facing a similar situation.
For transcriptome completeness assessment, the result between BUSCO v 2.0 and CEGMA could be
very different, i.e. BUSCO completness (C:55.1%[S:35.6%,D:19.5%],F:6.3%,M:38.6%,n:978) but cegma produces higher scores (74.19 completeness; 83.06 partial).
So I wonder how you deal with this in the end, you just stick only to CEGMA?
This is the command:
python3 BUSCO_v1.22.py -o output -in genome.fasta -l /db/busco/eukaryota -m genome
With these versions of the tools:
Is the transcriptome extracted from the genome using gene annotations? If not, it's possible that your transcriptome is just more complete/correct than the genome.
I have heard (second-hand, admittedly) of cases where transcripts map back to the genome >95%, but the genome completeness is low; see my below reply re: nematode. However, if one uses the transcripts to derive gene models (e.g. using MAKER), then uses BUSCO on the gene model sequence, the % completeness goes up.
Based on the BUSCO manual:
So, maybe Augustus has a hard time deriving accurate gene models de novo leading to poor BUSCO scores, but when assisted using transcriptome data BUSCO works more effectively?
Yeah, could well be.
I'm not convinced that BUSCO is doing the right thing here either, but working through the code is like walking through treacle...
Both are data derived from a curated species, Dictyostelium discoideum, so should be pretty reliable and similar.