A few programs allow users to remove all but the longest isoforms from their genome annotations. But the nature of isoforms suggests that this is better done with the transcriptome, unless I misunderstand. On this thread, the questioner was directed to resources (such as the Trinity wiki) for eliminating isoforms from the transcriptome.
So my question is: When annotating a genome, does removing isoforms make more sense pre-annotation with the transcriptome, or could someone annotate a genome using an unfiltered transcriptome and later filter the final annotation?
Background: I recently asked the community for help because I found high duplicated BUSCOs in my final genome annotation (assessing the exons on transcriptome mode on BUSCO 5.2.2), whereas my BUSCO scores for my genome assembly were great with only a small percent of duplicates. Helpful folks suggested tools to remove short isoforms from my annotations, but looking into it alerted me that the BUSCOs of my transcriptomes had a high proportion of duplicates.
So I wonder whether I need to go back and fix the transcriptome and then redo the annotations, or if removing isoforms post-annotation suffices. Any ideas are much appreciated.
If you believe that the isoforms are reliable, I don't think they should be removed from the transcriptome or the annotation. This is valuable biological information - why throw it away? For the purpose of BUSCO analysis and for convenience of future users, you can create reduced annotation and transcriptome versions containing only one mRNA per gene.
Thanks for the insight, liorglic. I find a lot of the advice about discarding the isoforms to be confusing, because I would be interested in alternate splicing. So I like your suggestion.
But when creating an annotation with filtered isoforms, should the isoforms be filtered from the gff post-annotation, or at an earlier stage: from the transcriptome (using something like CD-HIT) that will later be used to annotate the genome?
If you filter the transcriptome using CDHIT, then you will loose the isoforms information. That's why I'd go with the "filter at the end" (gff) approach. This is assuming that your annotation software (which one are you using?) can handle isoforms data (e.g. PASA). If you're not sure, you can always try both options and see what happens.
Thanks again for the helpful suggestions, liorglic. I'm using Maker for structural annotation, with the transcriptome as transcript evidence.