running BUSCO on all isoforms or longest isoform per gene?
2
0
Entering edit mode
6.7 years ago
Farbod ★ 3.4k

Hi Biostars,

In order to assess the completeness of de novo transcriptome assembly, I have used BUSCO_v3.

As my assembly belongs to a fish I have used actinopterygii_odb9 as my lineage dataset.

I have ran BUSCO once on my whole assembly (in Trinity usually most genes have several isoforms, I keep them all in this try) and once on longest isoform per each gene, only.

Obviously, the duplication rate was decrease in the second approach.

Q: Which one is the correct approach resulting into more biologically meaningful answer?

Thanks

NOTE: most genes are already duplicated in fishes. ;-)

MY BUSCO Script:

python scripts/run_BUSCO.py -i Trinity.fasta  -o OUTPUT_All_isoforms -l actinopterygii_odb9 -m tran --cpu 8
assembly RNA-Seq • 3.4k views
ADD COMMENT
2
Entering edit mode
6.7 years ago

You ask: "Which [way gives a] more biologically meaningful answer?" Don't leave out the other aspect: computationally meaningful.

Since you measured both (all isoforms vs longest), did you find a score difference? If you find a different BUSCO score for Missing+Fragmented conserved genes, then "all" has fewer misses, and the computational details matter.

Here are computational points: point 1. Use busco -m protein, as busco.py transcript translation mode is flaky, not to be trusted. See below.

point 2. Measure all isoforms, as closest homology is often enough for a shorter isoform. You can then rescore the BUSCO summary using only best-homology isoform per locus, if you want to remove those isoform-Duplicate counts.

correlary of 1. Dont measure longest transcript, but longest protein if you must do only one/locus. Longest transcripts are often those with artifacts, joins/chimera and insertions in coding sequence that break their protein homology, while making them longer.

point 3. Busco's Single/Duplicate measure is not very useful, as most gene sets under-report paralogs. Paralogs are harder to reconstruct, and are often left out of gene sets, making the 'single-copy' estimate of OrthoDB a computational rather than biological criterion. Also distinguishing locus alternates and paralogs is tricky even with a good chromosome assembly to map loci; alternate isoforms can look like paralogs, and vice versa. My recommendation is just ignore the BUSCO single/duplicate distinction. Missing and fragmented conserved genes are the ones to be concerned with.

p1 details: the BUSCO.py -m tran (transcript mode) has a very poor (quick and dirty) method of translating transcripts in all frames, in pieces, into proteins. You should instead use an accurate transcript to protein translator, and run BUSCO software in protein mode to get accurate answers, ones that match what other homology assessments, or public uses, of your transcripts will be.

Do the test yourself, you get different BUSCO results from -m tran versus -m protein. The reason is transcripts can have many kinds of artifacts that scramble their coding sequences, and can have parts of coding sequences mashed together in different ways.

-- Don Gilbert

Disclosure: I develop/provide accurate gene reconstruction software called EvidentialGene , and I pay attention to such details of gene data informatics.

ADD COMMENT
0
Entering edit mode

Hi Don, Thanks for the insight! I am planning on doing the same thing as you suggested (i.e. translate the transcript reads to proteins and filter for the longest protein). However I found it tricky to do so: I tried using TransDecoder which returns the translation of the full exons rather than the ORF/CDS. Thus the lengths of translated sequences do not represent protein length.

Am I getting into the wrong way? Is there an easier way/tool to do the translation and filtering

Thank you for your help! Peiwen

ADD REPLY
0
Entering edit mode
6.7 years ago

Running it once using the longest isoform is the most appropriate way to go. This will give you the result you're looking for == "to asses the completeness of your assembly result".

One thing you might consider is to first do ORF prediction on the transcripts and run the resulting proteins through BUSCO (as the built-in tools to predict genes in BUSCO is less sensitive on transcripts)

ADD COMMENT
0
Entering edit mode

Hi and thanks,

As genome duplication sometimes can result in similar genes with different function, do you think it is harmless to remove all other isoforms (duplicated genes, alternative splicing?), here?

I mean the species has in fact a high percentage of duplication, is it OK to decrease it intentionally?

ADD REPLY
2
Entering edit mode

No, then you're taking it a step to far!

duplicated genes should be left as they are. Moreover, I would certainly not catalog duplicate genes as isoforms!! They are distinct gene loci, isoforms are distinct transcripts from the same gene loci (== alternative splicing) which is a totally different thing).

ADD REPLY

Login before adding your answer.

Traffic: 2260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6