Hi, all. Can anyone tell me why I might have very low duplicate BUSCOs after assembly but quite high after annotation?
After assembly:
C:98.1%[S:97.0%,D:1.1%],F:0.6%,M:1.3%,n:2124
After annotation:
C:97.0%[S:59.7%,D:37.3%],F:1.0%,M:2.0%,n:2124
By the way, this is an insect sequenced via 10x Genomics and assembled "long reads" with Supernova 2.0.1. I annotated with a close relative as a reference, with Maker set for Eukaryota, and I used BUSCO 5.2.2 with Endopterygota odb10.
According to the recent BUSCO paper, duplicated BUSCOs could be formed from poor assembly of haplotypes, but the level of phasing I got with Supernova should be decent.
<strike>Sounds like a fragmented annotation.</strike>
You might check what is going on to the BUSCO genes from your assembly to the annotation using agat_sp_compare_two_BUSCOs.pl from AGAT.
Thanks for the quick response, Juke. I'll look into AGAT, although I think I've had problems with both Singularity and Docker.
What exactly do you mean by fragmented annotation? I masked my custom repeat library and did 3 rounds of Maker with a close relative as protein evidence, as well as the Swiss-Prot omnibus. Each round used Augustus and SNAP trained on the prior round. So I'm not sure where the problem would occur, or what would be done differently. (I've also done downstream GO work, but may need to redo this depending on what the issue is.)
My first thought was fragmented annotation (i.e. a gene seen complete in the assembly annotated in several pieces in the annotation). But I'm wrong because it should end up in the fragmented part of the BUSCO which is not your case.
Using agat_sp_compare_two_BUSCOs.pl you will probably decipher your thought. You need to load the tracks within a genome browser and look at what are the duplicated one found in the annotation. Were they already found by BUSCO in the assembly or it is just new genes.
Your annotation BUSCO score is really good. I think your annotation went well and you annotated "new" genes... are they real duplicates or artifact due to assembly/phasing issues, you should investigate.
Thanks, Juke. I think you're right. I've been using JBrowse to visualize, so I'll take another look. Yes, I'm concerned that the duplicates are legitimate, so I want to be certain before filtering. I will look into AGAT again to see whether it highlights any issues.
Does "after annotation" mean you ran BUSCO on the predicted genes or proteins in transcript or proteome mode? Otherwise, it doesn't make much sense, because the annotation does not alter the assembly. The duplication could be caused by predicted isoforms of the same gene. You need to reduce all genes to their longest isoform to get realistic numbers for single copy on the annotation. Because the scores for the assembly are excellent, I think your assembly is not the problem. Your annotation is likely fine as well, it is just this technical detail.
Edit: As pointed out by Juke34, the single isoform annotation should be used only for obtaining BUSCO scores, and possibly single-copy orthologue finding.
Thanks for the feedback, Michael. I used BUSCO on the exons from Maker's final annotation, set on transcriptome mode and selecting the lineage Endopterygota. I think you're right about the isoforms -- I had been wary of filtering out isoforms because I am mostly interested in detecting paralogs, but I'll see what the best method is to do this.
Any suggestions? I'm hoping this can be done post-annotation, but whatever works.
Thanks, Juke34. I used AGAT to reduce my duplicated BUSCOs to 7.7%. However, I wonder if filtering the isoforms on the transcriptome makes more sense than post-annotation. Any thoughts?
If the annotation is made on genome assembly filtering the isoforms post-annotation for BUSCO is the way to go.
If the annotation is made on transcriptome assembly then you may also filter the transcriptome as explained here.
Basically the problem can be on transcriptomes assembly where several isoforms from close genes might be all grouped in a single gene, or a set of isoforms from a single gene are seen as coming from different genes. When mapped to a genome for genome annotation then most of problems should vanish.
Thanks so much for the advice, Juke. I assembled a transcriptome using Trinity, but I didn't annotate it. Instead I used it as transcript evidence in Maker when annotating a genome. So I suppose filtering the isoforms after genome annotation is probably the way to go.
Just curious: when you say "for BUSCO" do you mean that I would filter isoforms only to correct the BUSCO score, but publish a genome that contains all isoforms? I am concerned about discarding isoforms since they depict alternate splicing.
By the way, I started a new post about this issue to see what others recommend, since it is sort of a tangential topic to this one. :^)
Yes filtering isoform here is usefull only to reflect a proper BUSCO score. For the annotation keep everything. Use agat_sp_statistics.pl to get statistics with and without isoforms.
Optimally, there would be a feature in BUSCO to treat isoforms different from gene copies. In some borderline cases taking the longest isoform might not even be the best choice.
That could, for example, work using a naming convention in the FASTA header, similar to the ENSEMBL FASTA headers containing gene and transcript ids.
I have simple perl scripts for doing the single isoform reduction for both Ensembl and GenBank style Fasta files should there be any need.
This is not a direct answer, but maybe you can verify your BUSCO results with MOSGA, uploading both sequences. Generally, the annotation should not necessarily change your sequence and therefore not affect your BUSCO results. But may you will find some more differences.
Just disable the phylogenetic analysis and enable BUSCO and EukCC (as a second genome completeness tool). EukCC only requires a freely GeneMark-ES/ET/EP license.
<strike>Sounds like a fragmented annotation.</strike>
You might check what is going on to the BUSCO genes from your assembly to the annotation using agat_sp_compare_two_BUSCOs.pl from AGAT.
Thanks for the quick response, Juke. I'll look into AGAT, although I think I've had problems with both Singularity and Docker.
What exactly do you mean by fragmented annotation? I masked my custom repeat library and did 3 rounds of Maker with a close relative as protein evidence, as well as the Swiss-Prot omnibus. Each round used Augustus and SNAP trained on the prior round. So I'm not sure where the problem would occur, or what would be done differently. (I've also done downstream GO work, but may need to redo this depending on what the issue is.)
My first thought was fragmented annotation (i.e. a gene seen complete in the assembly annotated in several pieces in the annotation). But I'm wrong because it should end up in the fragmented part of the BUSCO which is not your case. Using
agat_sp_compare_two_BUSCOs.pl
you will probably decipher your thought. You need to load the tracks within a genome browser and look at what are the duplicated one found in the annotation. Were they already found by BUSCO in the assembly or it is just new genes.Your annotation BUSCO score is really good. I think your annotation went well and you annotated "new" genes... are they real duplicates or artifact due to assembly/phasing issues, you should investigate.
Thanks, Juke. I think you're right. I've been using JBrowse to visualize, so I'll take another look. Yes, I'm concerned that the duplicates are legitimate, so I want to be certain before filtering. I will look into AGAT again to see whether it highlights any issues.