Hi have a new strain of bacterial genome for which sequencing was done using illumina.
The sequence is a pair-end read for which I have done de-novo assembly and generated contigs with minimum length of 200 using CLC genomics workbench and online servers.
Now, my aim is to assemble this contigs into a whole sequence genome. Is there any software (for windows7) or online server to assemble the contigs into a single genome?
Refrain working with CLC gw until and unless you are not familiar with linux at all. You could have used soap denovo for bacterial genome assembly.
Anyway, are you sure that you have contigs from CLC? Look at this image:
What you get from CLC is a fasta file having scaffolds. You can check this by exporting and opening the fasta file into a text editor (since you are working in windows) and looking out for 'n/N' in the sequence which are gaps.
It is not possible to get a single sequence representing the entire genome (for obvious reasons of shotgun sequencing). However, it is possible to judge the quality of assembly. Check out these posts here and here.
Pacbio data can produce a single contig representing the entire genome.
Refrain working with CLC gw until and unless you are not familiar with
linux at all.
Vijay Lakhujani : It is not appropriate to tell other users what they should or should not do since we don't know their circumstances. CLC gw is a perfectly valid option for users restricted to using Windows. CLC has been around for many years and is actively developed/supported.
You can certainly suggest other/better software options is you want to help.
It is not possible to get a single sequence representing the entire
genome (for obvious reasons of shotgun sequencing)
That is also not correct. With bacterial genomes it is certainly possible to get a single contig representing the entire genome, provided one had the right kind of libraries/coverage.
CLC gw is a perfectly valid option for users restricted to using
Windows.
I might be opening another debate here (open source v/s commercial software). Commercial tools often hide minute algorithmic details because of obvious trade/business reasons. The down side of adopting a commercial solution is, inevitably, some loss of flexibility and configurability. A significant danger is the temptation to simply apply a pre-configured workflow and treat it as a "black box" without fully considering or understanding whether each of the step is appropriate for a particular project's objectives and datasets. Additionally, commercial software tend to replace simple scientific keywords with other terms (example "kmer" with "word") which could be confusing to users; though they mean the same thing. Not everything hardcoded inside is disclosed which forces users to have a blind faith on the software.
On the other hand, open source software codes are publicly available and can be easily hacked into accordingly. Additionally, open source tools have prescriptive published protocol. Everything is clearly understood and any error or deviation from the expectations could be tracked.
With bacterial genomes it is certainly possible to get a single contig
representing the entire genome, provided one had the right kind of
libraries/coverage.
It's possible in rare circumstances where one is ready to pay for "required coverage" to get a assembly in one contig.
In adition, if you have a very close related specie (to genomic level) with a complete assembled genome, you can use it as reference to assembly your reads (spades or idba_hybrid; in linux by command line). But with paired end reads it would be very difficult (or even imposible).
if you have a reference, I would recommend to use scaffold builder (or better scaffold builder source forge) to map your contigs against a close reference
This post was cited in https://magiduck.github.io/DAGGER/ "Interactive graph-based visualization of genome architecture comparisons "
Thanks for sharing this information Pierre Lindenbaum