I have 8 reps of illumina paired-end reads from a fungal RNA-seq experiment that I have de novo assembled using trinity. Trinity says to use the --jaccard_clip function if you predict high gene density, which may be the case for a small fungal genome.
I assembled the transcriptome twice. Once with --jaccard clip and once without --jaccard_clip and preformed a couple of the recommended quality assessment steps for each.
Read content in each of the transcriptomes is good. 99.12% of reads map to the transcriptome with --jaccard_clip transcriptome, and 98.97% to the transcriptome without --jaccard_clip.
Below, stats for each transcriptome generated using trinitystats.pl.
With --Jaccard_clip
Counts of transcripts, etc.
Total trinity 'genes': 18674
Total trinity transcripts: 30205
Percent GC: 62.50
Stats based on ALL transcript contigs:
Contig N50: 2481
Median contig length: 674
Average contig: 1296.03
Total assembled bases: 39146732
Stats based on ONLY LONGEST ISOFORM per 'GENE':
Contig N50: 2336
Median contig length: 363
Average contig: 1034.59
Total assembled bases: 19319873
Without --jaccard_clip
Counts of transcripts, etc.
Total trinity 'genes': 6106
Total trinity transcripts: 18773
Percent GC: 62.38
Stats based on ALL transcript contigs:
Contig N50: 4196
Median contig length: 2074
Average contig: 2720.12
Total assembled bases: 51064844
Stats based on ONLY LONGEST ISOFORM per 'GENE':
Contig N50: 3986
Median contig length: 1973.5
Average contig: 2514.36
Total assembled bases: 15352683
Trinity also recommend counting full length transcripts with BLAST to swissprot. Below transcripts were aligned to their best protein hit. The chart displays number of transcripts at various percent coverages. For more info on this chart, see https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity-Transcripts
With --Jaccard_clip
hit_pct_cov_bin count_in_bin >bin_below
100 1046 1046
90 571 1617
80 404 2021
70 331 2352
60 327 2679
50 330 3009
40 305 3314
30 231 3545
20 281 3826
10 120 3946
Without --jaccard_clip
hit_pct_cov_bin count_in_bin >bin_below
100 2313 2313
90 764 3077
80 552 3629
70 440 4069
60 381 4450
50 333 4783
40 314 5097
30 269 5366
20 220 5586
10 102 5688
I would like to choose the better of these transcriptomes for my analysis, but Im still not sure which is the most representative. Does anyone have advice about how to make the final selection?
Hey Chris, thanks!
I tried BUSCO for the longest gene isoform of both files but didn't get great number for either assembly...
With --jaccard_clip
Without --jaccard_clip
It's a transcriptome, so you may not get the complete set of BUSCOs for your taxonomic group (this only represents what is expressed, unlike a genome).
The key is using this to compare various assembly versions (or assemblies using different tools). They both are fairly comparable but the
--jaccard-clip
is slightly higher. It might be better to run on all the data (not just the longest) using the 'transcriptome' mode if you aren't already doing that; the longest rep sequence may not always be the best.