I am doing genome assemblies with canu followed by two rounds of racon and two rounds of pilon.
The first time I performed an assembly on my dataset using this protocol, I ran BUSCO and returned a score of 97%. This was on a long read dataset of about 26 GB fro a dipteran genome.
I did another sequencing run and added ~10 GB of data to the assembly.
I followed the same protocol and ran BUSCO. The score decreased to 94%, due to fragmented BUSCO. How could this be possible? This isn't really a coding question, but I don't see how adding more data could create a worse assembly.
Can you specify what the 97% includes? Is that both complete and fragmented BUSCOs? Between the runs was there a decrease in the number of fragmented BUSCOs relative to complete BUSCOs?
Differences in assembly size would also be helpful in thinking about this problem. You may have added 10Gb of sequencing data, but it is possible for all of that to collapse down to produce an assembly of roughly the same size, just more contiguous. A distribution of scaffold lengths could also be insightful (N50 metric can be deceiving).
You find the differential BUSCOs between the two assemblies and align a close ortholog from Drosophila or any other closer species with a min identity lower then that of BUSCOs threshold, which is somewhere around 90%. If you hit the same region on both assemblies, you can extract those from the genome and look for nucleotide changes between the two. By observing this, you might convince yourself that the extra data changed the consensus sequences in some regions and caused the decrease in BUSCO completeness score.