You ask: "Which [way gives a] more biologically meaningful answer?"
Don't leave out the other aspect: computationally meaningful.
Since you measured both (all isoforms vs longest), did you find a score difference? If you find a different BUSCO score for Missing+Fragmented conserved genes, then "all" has fewer misses, and the computational details matter.
Here are computational points:
point 1. Use busco -m protein, as busco.py transcript translation mode is flaky, not to be trusted. See below.
point 2. Measure all isoforms, as closest homology is often enough for a shorter isoform. You can then rescore the BUSCO summary using only best-homology isoform per locus, if you want to remove those isoform-Duplicate counts.
correlary of 1. Dont measure longest transcript, but longest protein if you must do only one/locus. Longest transcripts are often those with artifacts, joins/chimera and insertions in coding sequence that break their protein homology, while making them longer.
point 3. Busco's Single/Duplicate measure is not very useful, as most gene sets under-report paralogs. Paralogs are harder to reconstruct, and are often left out of gene sets, making the 'single-copy' estimate of OrthoDB a computational rather than biological criterion. Also distinguishing locus alternates and paralogs is tricky even with a good chromosome assembly to map loci; alternate isoforms can look like paralogs, and vice versa. My recommendation is just ignore the BUSCO single/duplicate distinction. Missing and fragmented conserved genes are the ones to be concerned with.
p1 details:
the BUSCO.py -m tran (transcript mode) has a very poor (quick and dirty) method of translating transcripts in all frames, in pieces, into proteins. You should instead use an accurate transcript to protein translator, and run BUSCO software in protein mode to get accurate answers, ones that match what other homology assessments, or public uses, of your transcripts will be.
Do the test yourself, you get different BUSCO results from -m tran versus -m protein. The reason is transcripts can have many kinds of artifacts that scramble their coding sequences, and can have parts of coding sequences mashed together in different ways.
-- Don Gilbert
Disclosure: I develop/provide accurate gene reconstruction software called EvidentialGene , and I pay attention to such details of gene data informatics.
Hi Don, Thanks for the insight! I am planning on doing the same thing as you suggested (i.e. translate the transcript reads to proteins and filter for the longest protein). However I found it tricky to do so: I tried using TransDecoder which returns the translation of the full exons rather than the ORF/CDS. Thus the lengths of translated sequences do not represent protein length.
Am I getting into the wrong way? Is there an easier way/tool to do the translation and filtering
Thank you for your help! Peiwen