Question

How to assemble contigs?

0

Entering edit mode

8.2 years ago

Paul ▴ 80

Hi have a new strain of bacterial genome for which sequencing was done using illumina.

The sequence is a pair-end read for which I have done de-novo assembly and generated contigs with minimum length of 200 using CLC genomics workbench and online servers.

Now, my aim is to assemble this contigs into a whole sequence genome. Is there any software (for windows7) or online server to assemble the contigs into a single genome?

denovo Assembly sequencing contigs • 16k views

ADD COMMENT • link updated 8.2 years ago by vmicrobio ▴ 290 • written 8.2 years ago by Paul ▴ 80

0

Entering edit mode

This post was cited in https://magiduck.github.io/DAGGER/ "Interactive graph-based visualization of genome architecture comparisons "

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for sharing this information Pierre Lindenbaum

ADD REPLY • link 5.5 years ago by lakhujanivijay 5.9k

score 2 · Answer 1 · 2017-05-17

2

Entering edit mode

8.2 years ago

lakhujanivijay 5.9k

Refrain working with CLC gw until and unless you are not familiar with linux at all. You could have used soap denovo for bacterial genome assembly.

Anyway, are you sure that you have contigs from CLC? Look at this image:

enter image description here

What you get from CLC is a fasta file having scaffolds. You can check this by exporting and opening the fasta file into a text editor (since you are working in windows) and looking out for 'n/N' in the sequence which are gaps.

It is not possible to get a single sequence representing the entire genome (for obvious reasons of shotgun sequencing). However, it is possible to judge the quality of assembly. Check out these posts here and here.

Pacbio data can produce a single contig representing the entire genome.

ADD COMMENT • link 8.2 years ago by lakhujanivijay 5.9k

2

Entering edit mode

Refrain working with CLC gw until and unless you are not familiar with linux at all.

Vijay Lakhujani : It is not appropriate to tell other users what they should or should not do since we don't know their circumstances. CLC gw is a perfectly valid option for users restricted to using Windows. CLC has been around for many years and is actively developed/supported.

You can certainly suggest other/better software options is you want to help.

It is not possible to get a single sequence representing the entire genome (for obvious reasons of shotgun sequencing)

That is also not correct. With bacterial genomes it is certainly possible to get a single contig representing the entire genome, provided one had the right kind of libraries/coverage.

ADD REPLY • link 8.2 years ago by GenoMax 152k

0

Entering edit mode

CLC gw is a perfectly valid option for users restricted to using Windows.

I might be opening another debate here (open source v/s commercial software). Commercial tools often hide minute algorithmic details because of obvious trade/business reasons. The down side of adopting a commercial solution is, inevitably, some loss of flexibility and configurability. A significant danger is the temptation to simply apply a pre-configured workflow and treat it as a "black box" without fully considering or understanding whether each of the step is appropriate for a particular project's objectives and datasets. Additionally, commercial software tend to replace simple scientific keywords with other terms (example "kmer" with "word") which could be confusing to users; though they mean the same thing. Not everything hardcoded inside is disclosed which forces users to have a blind faith on the software.

On the other hand, open source software codes are publicly available and can be easily hacked into accordingly. Additionally, open source tools have prescriptive published protocol. Everything is clearly understood and any error or deviation from the expectations could be tracked.

With bacterial genomes it is certainly possible to get a single contig representing the entire genome, provided one had the right kind of libraries/coverage.

It's possible in rare circumstances where one is ready to pay for "required coverage" to get a assembly in one contig.

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

In adition, if you have a very close related specie (to genomic level) with a complete assembled genome, you can use it as reference to assembly your reads (spades or idba_hybrid; in linux by command line). But with paired end reads it would be very difficult (or even imposible).

ADD REPLY • link 8.2 years ago by Buffo ★ 2.4k

score 0 · Answer 2 · 2017-05-17

0

Entering edit mode

8.2 years ago

vmicrobio ▴ 290

if you have a reference, I would recommend to use scaffold builder (or better scaffold builder source forge) to map your contigs against a close reference

ADD COMMENT • link 8.2 years ago by vmicrobio ▴ 290