There are numerous de novo assemblers such as ABySS, SPAdes, AMOS, etc. And there is only one transcript assembler that I know of: Inchworm in a set of three tools of Trinity
I would like to have your thoughts about why those de novo assembler can or cannot be used as a transcript assembler. This is to understand the background of this matter.
Genomic and transcriptomic data are quite different in some fundamental aspects, here a incomplete list:
Coverage: typical short read assembler (including those you mentioned) use kmer based graph structures for the reconstruction of the underlying sequence. For genomes, the kmer-coverage profile is quite distinct, with kmers form non-repetitive regions clustering around the sequencing depth of your sample, erroneous kmers at low frequencies and repeat stuff at high frequencies. This spectrum is used in assemblers during error correction, graph optimization /evaluation etc. The underlying assumptions, however, are not true at all for RNA-seq data. Here, the abundance of each transcript determines the frequency of corresponding kmers and you will get very different spectra.
Structural variants: In a (haploid) genome data set, you don't expect a lot of structural variances, and if you do, you often want to merge them into a single haplotype assembly. In transcriptomes, quite the opposite is the case. Alternative splicing produces a plentitude of structural variants for the same regions. This characteristic cannot be captured with denovo genome assemblers and most likely will result in individual fragments corresponding to single exons.
(There are other transcriptome assemblers, e.g.: OASIS)
ADD COMMENT
• link
updated 23 months ago by
Ram
44k
•
written 9.5 years ago by
thackl
★
3.0k
You should clarify how you are differentiating a "de novo assembler" and a "transcript assembler". I feel like there might be some confusion on the usage of these terms. Do you mean to say genome assembler vs transcriptome assembler?
The biggest difference between genome and transcriptome assembly is coverage. Barring repetitive or highly conserved regions, a genome ideally would have even coverage. A transcriptome, on the other hand, have differential coverage across each transcript depending on it's expression. Think of each transcript as a whole "genome" and a transcriptome assembly as trying to assemble many mini-genomes from a pool of mixed genomic reads (like a meta-genomic assembly).
@Damian Kao: You are right that a transcript(ome) assembler is technically a de novo assembler. I used the term "de novo assembler" out of common sense. With that said, however, I think I got the logic right: Not all de novo assemblers are transcript assemblers. Isn't that "genome assembler" also include "reference assembler" and "de novo assembler". I am not trying to argue, just want to understand the point.
Thanks for the hint about coverage and the analogy of transcripts as mini-genomes.
I think that a distinction between genome and transcriptome assemblers is more informative. All genome assemblers are "de novo assemblers"; whereas transcriptome assemblers can be classified into "de novo assembly" and "reference assembly".
There are actually a couple of de novo transcriptome assemblers (SOAPdeNovo, Trans-Abyss, Trinity, Velvet-Oases) and the only reference transcriptome assemblers I can think of is Cufflinks and the recent StringTie.
I am aware of another transcriptome assembler: IDBA-Tran.