Question

High quality de novo transcriptome assembly rely on merging multiple assembly?

0

Entering edit mode

10.2 years ago

seta ★ 1.9k

Dear all,

Please let me know your experience regarding combining multiple assembly (derived from different k-mer or different programs) to make the best de novo transcriptome assembly and subsequently having the high-quality annotation?. I've done de novo assembly using several k-mer by CLC on about 400 million illumina reads (100 PE) (10 type of assembly), and I'm going to try trinity, too and finally combine these multiple assembly to have the highly informative one for a non-model organism, which has little information in public databases. Also, it will be great if you mention the perfect tool in your view to combine assembly? Any suggestion and comments would be highly appreciated.

Assembly next-gen RNA-Seq • 5.7k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by seta ★ 1.9k

0

Entering edit mode

Hi seta, thank you for your question on transcriptome assembly tools and pipelines. May I suggest that you add some more information about the actual data and experimental settings in order to add some 'flesh' to your question?

In bioinformatics or science in general, there is often not "the best" or "optimal" or "perfect" tool for a (weakly defined) task. Instead, the optimal tools depend on a lot of factors including the data, experimental question, computational complexity and constraints, and the like. Asking for the best tool without context therefore is pointless and unscientific in my understanding, it might lead to flame wars and subjective discussions, and quickly go out of focus.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Michael 55k

0

Entering edit mode

Thanks Michael to correct me! I added some information. That's right, we have not the best or perfect at all, but someone may find a program or tool is much efficient as compared to other.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by seta ★ 1.9k

Ram · Answer 1 · 2015-02-11

1

Entering edit mode

10.2 years ago

rtliu ★ 2.2k

See similar post: merged transcripts from RNA de novo assembly to create a reference transcriptome

Corset makes use of both the sequence similarity and expression data available to cluster contigs, that is why Corset does a better job than CD-HIT-EST.

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by rtliu ★ 2.2k

Ram · Answer 2 · 2015-02-11

Hi Seta, CD-HIT-EST clustering is the best in class for merging multiple transcriptome assembly. It simply keeps only larger sequences and removes the partial/subset/smaller sequences. There are lot parameters to play with, I would recommend you to use only -s (the shorter sequences needs to be at least XX% length of the representative of the cluster). Hope this will help.

Ram · Answer 3 · 2015-02-11

0

Entering edit mode

10.2 years ago

Brian Bushnell 20k

I wrote a tool, Dedupe, for merging multiple assemblies and removing redundant contigs. It's designed specifically for this purpose, with various options for controlling which sequences are considered duplicates based on contig overlap length, number of substitutions, edits, and so forth. The basic usage is like this:

dedupe.sh in=assembly1.fa,assembly2.fa,assembly3.fa out=merged.fa

...which will just eliminate all exact duplicate or fully-contained subsequences. You can get complete usage information by running dedupe.sh with no arguments.

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for all comments. Dear Brian, could you please let me share a paper or document that is explained the "Dedupe "tool, in detail. Yeah, It's great for my purpose, but it's better to know how it's test. Honestly, I found the "evidentialgene: tr2aacds" for merging transcriptome assembly and some valid papers that used it, however one of users mentioned that some parameters, like N50, the CEGMA analysis results, and the percentage of mapped back reads for transcripts resulting from evidentialgene: tr2aacds" tool was significantly unsatisfactory as compared with the individual assembly. So, I'm totally a bit in a doubt about merging assemblies. I would highly appreciated for hearing experience from all users that find the merging several assemblies is a good idea or not, please let us know your findings, in detail? Thanks a lot for your consideration

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by seta ★ 1.9k

0

Entering edit mode

Dedupe has been used in production on all metagenomic assemblies at JGI for about a year. If you want to merge multiple assemblies of the same data, I highly recommend it; it will only remove redundant contigs, not do any assembly and not try to combine different contigs. So, if a read mapped to the un-merged assembly, it will still map to the merged assembly just as well, since no unique sequences are removed or altered (with the default settings).

That said, whether or it is a good idea to generate and merge multiple assemblies is a different question. The approach will generally lead to some redundancy that Dedupe won't remove (because neither sequence fully contains the other, or they don't match perfectly) so you'll end up with a larger-than-expected assembly.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian for clarification. I'm working on plant transcriptome assembly, so I don't think the Dedupe is suitable for me as I need a tool generate a final assembly after removing identical contigs. Still, waiting for any suggestion and experience from all users.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by seta ★ 1.9k

0

Entering edit mode

You can run Dedupe in a mode that will only remove identical contigs, just by adding the flag ac=f which turns off looking for contained subsequences.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k