BUSCO duplicated
1
0
Entering edit mode
12 months ago
san96 ▴ 160

Hello everyone,

I just assembled a data set from a non-model organism with Trinity, however, I am getting many contigs. I ran cd-hit to remove the redundancy, but I still have many contigs. I am also concerned about having a high duplication rate according to BUSCO. What do you recommend I do?

Before CD-HIT:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  277062
Total trinity transcripts:      416235
Percent GC: 42.48

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3438
        Contig N20: 2526
        Contig N30: 1984
        Contig N40: 1583
        Contig N50: 1231

        Median contig length: 451
        Average contig: 774.04
        Total assembled bases: 322183936


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426

After of CD-HIT (cd-hit-est -o cdhit -c 0.98 -i Trinity.fasta -p 1 -d 0 -b 3 -T 10):

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  276194
Total trinity transcripts:      396337
Percent GC: 42.40

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3325
        Contig N20: 2428
        Contig N30: 1903
        Contig N40: 1504
        Contig N50: 1158

        Median contig length: 437
        Average contig: 744.38
        Total assembled bases: 295026540


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 371
        Average contig: 604.02
        Total assembled bases: 166826505

enter image description here

BUSCO Trinity • 1.8k views
ADD COMMENT
0
Entering edit mode

I just assembled a data set from a non-model organism with Trinity, however, I am getting many contigs

Since you used trinity this must be RNAseq data. In that case getting many contigs is not unexpected nor is some "redundancy". Did you run BUSCO in transcript mode?

ADD REPLY
0
Entering edit mode

Hello Geno,

If it is RNA-seq data and I ran BUSCO in Galaxy in transcriptome mode:

# BUSCO version is: 5.5.0 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255)
# BUSCO was run in mode: euk_tran

    ***** Results: *****

    C:99.2%[S:20.4%,D:78.8%],F:0.4%,M:0.4%,n:255       
    253 Complete BUSCOs (C)            
    52  Complete and single-copy BUSCOs (S)    
    201 Complete and duplicated BUSCOs (D)     
    1   Fragmented BUSCOs (F)              
    1   Missing BUSCOs (M)             
    255 Total BUSCO groups searched        

Dependencies and versions:
    hmmsearch: 3.1
    makeblastdb: 2.14.1+
    tblastn: 2.14.1+
    busco: 5.5.0
    metaeuk: 6.a5d39d9
ADD REPLY
0
Entering edit mode

A version of the genome already exists, however, the authors have not yet authorized its use for massive studies:

enter image description here

ADD REPLY
3
Entering edit mode
12 months ago
Dave Carlson ★ 2.1k

From the BUSCO documentation:

Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis.

You will want to run BUSCO on a version of the assembly that only contains a single isoform per Trinity "gene". Otherwise, the results will not be interpretable . Trinity comes with some utility scripts that facilitate doing this that may be more effective than using cd-hit-est.

To retain only the longest isoform:

https://github.com/trinityrnaseq/trinityrnaseq/blob/master/util/misc/get_longest_isoform_seq_per_trinity_gene.pl

And to estimate expression and retain only the most highly expressed isoform, see the following:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#filtering-transcripts

Obviously, you don't necessarily want to use these filtered transcriptome assemblies for all your other downstream analyses, but they will likely be more useful for running BUSCO.

ADD COMMENT
0
Entering edit mode

Hello Dave, Thanks for your valuable comment, I have run the script to retain the longest isoform and I have obtained the following, what do you think about this?

I appreciate your comment again.

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  277062
Total trinity transcripts:      277062
Percent GC: 41.59

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794

        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426

enter image description here

ADD REPLY
1
Entering edit mode

Just based on the raw numbers of BUSCOs in your new plot, it looks to me like you used a different lineage than in your first BUSCO plot, so it's hard to compare the two sets of results. But for a de novo transcriptome assembly, I think it looks relatively standard.

ADD REPLY
0
Entering edit mode

Hi Dave,

This is the one with the same lineage, I appreciate your response. My data is from a succulent plant (Agave) and I used the eukaryota_odb10 lineage configuration.

enter image description here

ADD REPLY
1
Entering edit mode

Thanks for that information. I'm not particularly surprised to see that a plant transcriptome has quite a few duplicate BUSCOs when compared against the eukaryota lineage, given how common whole genome duplication has been in plant evolution.

The fact that the proportion of duplicates is now lower seems to suggest that using only 1 isoform per trinity gene was helpful.

ADD REPLY
2
Entering edit mode

Perhaps using "viridiplantae" lineage would have been the most appropriate here?

ADD REPLY
1
Entering edit mode

Yes, I would agree.

ADD REPLY
0
Entering edit mode

Thanks, I have performed the new analysis.

ADD REPLY
1
Entering edit mode

Hello,

Thank you for your recommendations, I have run the analysis again with viridiplantae_odb10.

Before extracting the longest isoform:

enter image description here

After:

enter image description here

# BUSCO version is: 5.5.0 
# The lineage dataset is: viridiplantae_odb10 (Creation date: 2024-01-08, number of genomes: 57, number of BUSCOs: 425)
# BUSCO was run in mode: euk_tran

    ***** Results: *****

    C:91.6%[S:87.1%,D:4.5%],F:5.2%,M:3.2%,n:425    
    389 Complete BUSCOs (C)            
    370 Complete and single-copy BUSCOs (S)    
    19  Complete and duplicated BUSCOs (D)     
    22  Fragmented BUSCOs (F)              
    14  Missing BUSCOs (M)             
    425 Total BUSCO groups searched        

Dependencies and versions:
    hmmsearch: 3.1
    metaeuk: 6.a5d39d9
    busco: 5.5.0
ADD REPLY

Login before adding your answer.

Traffic: 3294 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6