Hello everyone,
I just assembled a data set from a non-model organism with Trinity, however, I am getting many contigs. I ran cd-hit to remove the redundancy, but I still have many contigs. I am also concerned about having a high duplication rate according to BUSCO. What do you recommend I do?
Before CD-HIT:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 277062
Total trinity transcripts: 416235
Percent GC: 42.48
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3438
Contig N20: 2526
Contig N30: 1984
Contig N40: 1583
Contig N50: 1231
Median contig length: 451
Average contig: 774.04
Total assembled bases: 322183936
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3086
Contig N20: 2125
Contig N30: 1538
Contig N40: 1097
Contig N50: 794
Median contig length: 370
Average contig: 603.55
Total assembled bases: 167220426
After of CD-HIT (cd-hit-est -o cdhit -c 0.98 -i Trinity.fasta -p 1 -d 0 -b 3 -T 10
):
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 276194
Total trinity transcripts: 396337
Percent GC: 42.40
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3325
Contig N20: 2428
Contig N30: 1903
Contig N40: 1504
Contig N50: 1158
Median contig length: 437
Average contig: 744.38
Total assembled bases: 295026540
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3086
Contig N20: 2125
Contig N30: 1538
Contig N40: 1097
Contig N50: 794
Median contig length: 371
Average contig: 604.02
Total assembled bases: 166826505
Since you used
trinity
this must be RNAseq data. In that case getting many contigs is not unexpected nor is some "redundancy". Did you run BUSCO in transcript mode?Hello Geno,
If it is RNA-seq data and I ran BUSCO in Galaxy in transcriptome mode:
A version of the genome already exists, however, the authors have not yet authorized its use for massive studies: