Question

BUSCO duplicated

0

Entering edit mode

10 months ago

sansan96 ▴ 130

Hello everyone,

I just assembled a data set from a non-model organism with Trinity, however, I am getting many contigs. I ran cd-hit to remove the redundancy, but I still have many contigs. I am also concerned about having a high duplication rate according to BUSCO. What do you recommend I do?

Before CD-HIT:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  277062
Total trinity transcripts:      416235
Percent GC: 42.48

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3438
        Contig N20: 2526
        Contig N30: 1984
        Contig N40: 1583
        Contig N50: 1231

        Median contig length: 451
        Average contig: 774.04
        Total assembled bases: 322183936


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426

After of CD-HIT (cd-hit-est -o cdhit -c 0.98 -i Trinity.fasta -p 1 -d 0 -b 3 -T 10):

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  276194
Total trinity transcripts:      396337
Percent GC: 42.40

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3325
        Contig N20: 2428
        Contig N30: 1903
        Contig N40: 1504
        Contig N50: 1158

        Median contig length: 437
        Average contig: 744.38
        Total assembled bases: 295026540


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 371
        Average contig: 604.02
        Total assembled bases: 166826505

enter image description here

BUSCO Trinity • 1.7k views

ADD COMMENT • link updated 10 months ago by GenoMax 147k • written 10 months ago by sansan96 ▴ 130

0

Entering edit mode

I just assembled a data set from a non-model organism with Trinity, however, I am getting many contigs

Since you used trinity this must be RNAseq data. In that case getting many contigs is not unexpected nor is some "redundancy". Did you run BUSCO in transcript mode?

ADD REPLY • link 10 months ago by GenoMax 147k

0

Entering edit mode

Hello Geno,

If it is RNA-seq data and I ran BUSCO in Galaxy in transcriptome mode:

# BUSCO version is: 5.5.0 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255)
# BUSCO was run in mode: euk_tran

    ***** Results: *****

    C:99.2%[S:20.4%,D:78.8%],F:0.4%,M:0.4%,n:255       
    253 Complete BUSCOs (C)            
    52  Complete and single-copy BUSCOs (S)    
    201 Complete and duplicated BUSCOs (D)     
    1   Fragmented BUSCOs (F)              
    1   Missing BUSCOs (M)             
    255 Total BUSCO groups searched        

Dependencies and versions:
    hmmsearch: 3.1
    makeblastdb: 2.14.1+
    tblastn: 2.14.1+
    busco: 5.5.0
    metaeuk: 6.a5d39d9

ADD REPLY • link updated 10 months ago by GenoMax 147k • written 10 months ago by sansan96 ▴ 130

0

Entering edit mode

A version of the genome already exists, however, the authors have not yet authorized its use for massive studies:

enter image description here

ADD REPLY • link 10 months ago by sansan96 ▴ 130

GenoMax · Accepted Answer · 2024-01-09

3

Entering edit mode

10 months ago

Dave Carlson ★ 1.9k

From the BUSCO documentation:

Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis.

You will want to run BUSCO on a version of the assembly that only contains a single isoform per Trinity "gene". Otherwise, the results will not be interpretable . Trinity comes with some utility scripts that facilitate doing this that may be more effective than using cd-hit-est.

To retain only the longest isoform:

https://github.com/trinityrnaseq/trinityrnaseq/blob/master/util/misc/get_longest_isoform_seq_per_trinity_gene.pl

And to estimate expression and retain only the most highly expressed isoform, see the following:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#filtering-transcripts

Obviously, you don't necessarily want to use these filtered transcriptome assemblies for all your other downstream analyses, but they will likely be more useful for running BUSCO.

ADD COMMENT • link 10 months ago by Dave Carlson ★ 1.9k

0

Entering edit mode

Hello Dave, Thanks for your valuable comment, I have run the script to retain the longest isoform and I have obtained the following, what do you think about this?

I appreciate your comment again.

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  277062
Total trinity transcripts:      277062
Percent GC: 41.59

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426

enter image description here

ADD REPLY • link updated 10 months ago by GenoMax 147k • written 10 months ago by sansan96 ▴ 130

1

Entering edit mode

Just based on the raw numbers of BUSCOs in your new plot, it looks to me like you used a different lineage than in your first BUSCO plot, so it's hard to compare the two sets of results. But for a de novo transcriptome assembly, I think it looks relatively standard.

ADD REPLY • link 10 months ago by Dave Carlson ★ 1.9k

0

Entering edit mode

Hi Dave,

This is the one with the same lineage, I appreciate your response. My data is from a succulent plant (Agave) and I used the eukaryota_odb10 lineage configuration.

enter image description here

ADD REPLY • link 10 months ago by sansan96 ▴ 130

1

Entering edit mode

Thanks for that information. I'm not particularly surprised to see that a plant transcriptome has quite a few duplicate BUSCOs when compared against the eukaryota lineage, given how common whole genome duplication has been in plant evolution.

The fact that the proportion of duplicates is now lower seems to suggest that using only 1 isoform per trinity gene was helpful.

ADD REPLY • link 10 months ago by Dave Carlson ★ 1.9k

2

Entering edit mode

Perhaps using "viridiplantae" lineage would have been the most appropriate here?

ADD REPLY • link 10 months ago by GenoMax 147k

1

Entering edit mode

Yes, I would agree.

ADD REPLY • link 10 months ago by Dave Carlson ★ 1.9k

0

Entering edit mode

Thanks, I have performed the new analysis.

ADD REPLY • link 10 months ago by sansan96 ▴ 130

1

Entering edit mode

Hello,

Thank you for your recommendations, I have run the analysis again with viridiplantae_odb10.

Before extracting the longest isoform:

enter image description here

After:

enter image description here

# BUSCO version is: 5.5.0 
# The lineage dataset is: viridiplantae_odb10 (Creation date: 2024-01-08, number of genomes: 57, number of BUSCOs: 425)
# BUSCO was run in mode: euk_tran

    ***** Results: *****

    C:91.6%[S:87.1%,D:4.5%],F:5.2%,M:3.2%,n:425    
    389 Complete BUSCOs (C)            
    370 Complete and single-copy BUSCOs (S)    
    19  Complete and duplicated BUSCOs (D)     
    22  Fragmented BUSCOs (F)              
    14  Missing BUSCOs (M)             
    425 Total BUSCO groups searched        

Dependencies and versions:
    hmmsearch: 3.1
    metaeuk: 6.a5d39d9
    busco: 5.5.0

ADD REPLY • link updated 10 months ago by GenoMax 147k • written 10 months ago by sansan96 ▴ 130