Hi All,
After one week Trinity finally completed an assembly starting with 800 million reads (an entire Next Seq 500 run). The statistics are weird, although there were tons of sequences, but I would like your opinion:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 858807
Total trinity transcripts: 924905
Percent GC: 40.20
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 1739
Contig N20: 769
Contig N30: 490
Contig N40: 382
Contig N50: 324
Median contig length: 270
Average contig: 363.98
Total assembled bases: 336649200
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 1123
Contig N20: 575
Contig N30: 421
Contig N40: 349
Contig N50: 304
Median contig length: 268
Average contig: 341.45
Total assembled bases: 293239534
I used Trinity with default parameters and using --trimmomatic
plus --min_kmer_cov 2
. I really was expecting the N50 to be bigger. What can be the reason for that?
Note: Before starting the assembly I quality filtered the sequences and merged the results in two big paired end fasta files.
Please any advice can be precious!
Thanks!
Giorgio
That's a lot of transcripts. Try using the read normalization parameter. What species are you assembling?
Thanks for your answer. I know it is a lot and I did not run the digital normalization, maybe I should have considering that I have close to a billion raw reads. Do you think that might be the reason of such low N50? Anyway the species is an Hawaiian Squid (no genome annotation at all since it has tons of STR), predicted genome of about 3.8 GB.