What Improvements Would You Recommend For This Genome Scaffolding Software?
5
19
Entering edit mode
13.7 years ago
Michael Barton ★ 1.9k

I've written a software tool that allows genome scaffolds to be reliably reproduced by writing the set of instructions to build the scaffold as a domain specific language. The software, "Scaffolder," parses this instruction file, fetches the corresponding contig sequences, and joins them together into a continuous super-sequence. Separating the contig-joining process into a separate file decouples the data from the steps required to build the scaffold.

I'm writing on BioStar because I hope this software will be useful to the bioinformatics and genomics community. Therefore any patches, comments or constructive criticism of this software will improve and, ideally, make this a useful resource.

Finally, in addition, this software has been submitted to the journal Open Research Computation. Therefore any comments made on this question directly feed into the peer-review process for the article. I believe this could be an interesting approach to peer-review and will add to suggestions made by the two reviewers.

Please separate suggestions into individual answers so they can be voted on individually. Multiple answers and votes are very welcome.

genome scaffolding next-gen-sequencing • 9.9k views
ADD COMMENT
2
Entering edit mode

You should probably include a discussion of the pros/cons of your YAML file format vis-a-vis the standard AGP file format in the manuscript.

ADD REPLY
1
Entering edit mode

The name Scaffolder has already been used for scaffolding software in the original Celera WGS assembler written by Gene Myers: http://www.sciencemag.org/content/287/5461/2196.abstract :(

ADD REPLY
0
Entering edit mode

can I vote twice ? :-)

ADD REPLY
0
Entering edit mode

Vote as many times as you like? :) I feel in unexplored territory.

ADD REPLY
0
Entering edit mode

A different name would be useful then to distinguish the software. I spent a while originally trying to think of different names but Scaffolder was the best I could come up with.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion on AGP. I'll look into this format in more detail. Is there a tool that converts AGP into the corresponding scaffold sequence?

ADD REPLY
0
Entering edit mode

How about "contigs2scaffolds" or "Scaffixer", in honor of the first patented scaffolding technology: http://www.scaffoldersforum.com/scaffolders-forum/2089-history-scaffolding.html. Other potential scaffolding related terminology can be found here.

ADD REPLY
0
Entering edit mode

Thanks Casey. Scaffolding related terms are an excellent idea. :)

ADD REPLY
6
Entering edit mode
13.7 years ago
Nick Loman ▴ 110

I would like it if Scaffolder would create a starter YAML file from an AGP file which is a format produced by Newbler amongst others.

Description of the AGP file is here: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml

ADD COMMENT
0
Entering edit mode

Thanks Nick. It should be relatively straight forward to write a conversion script for AGP to YAML. Are there any other common formats in addition to AGP?

ADD REPLY
0
Entering edit mode

Not that I'm aware. But one possible format to be aware of are short-read assemblers like Velvet which produce "scaffolded contigs" which are contigs separated by Ns of known length. It would be quite nice to turn that into a YAML description too.

ADD REPLY
0
Entering edit mode

Do these assembers produce AGP output along with the scaffolded contigs? Otherwise I think it would require using sequence alignment to determine which contigs are in which scaffold. Not impossible but more room for error.

ADD REPLY
5
Entering edit mode
13.7 years ago

It would be cool if there was a tool to convert contig-relative coordinates (like those found in a gff3 files) to scaffold-relative coordinates and back using just the Scaffolder file

ADD COMMENT
0
Entering edit mode

Thanks Jeremy. This is very important and something I have been thinking about. If you had a set of contigs that were already annotated then joining them into a scaffold should produce the corresponding set of combined gene annotation locations. This would make it much simpler to rebuild and update an annotated genome. There a one hurdles to this though. The original annotated contig sizes may be edited in the draft scaffold which would require changing all the gene coordinates downstream of this point. This is by no means impossible though and is something I've have already tried hacking toge

ADD REPLY
0
Entering edit mode

Started working on this http://bit.ly/fNQiaf . Any suggestions are welcome http://bit.ly/i7LlmA.

ADD REPLY
4
Entering edit mode
13.7 years ago
Nick Loman ▴ 110

A common problem I have is that I will make an assembly - join contigs together and then leave it in draft form.

Later on, there may be an update to the assembler I use (usually Newbler) - or I may get some new data - and I will re-run the assembly. Often the assembly is materially the same - perhaps a bit improved - but the contig names will have changed.

It would be great if Scaffolder had a way so I could port my joins from the original assembly identifiers to the new assembly in a way that handled unambiguous joins but flagged up any potential discrepancies.

ADD COMMENT
0
Entering edit mode

It should be possible to find identical contigs just by hashing the encoding sequence. Very similar contigs might be identified using an alignment algorithm. Based on this it should be possible contrast the sequence between builds and highlight differences.

ADD REPLY
4
Entering edit mode
13.7 years ago

It would be nice to include provision for describing circular genomes in your YAML file.

ADD COMMENT
0
Entering edit mode

Thanks Casey. That's a good suggestion. So far I have been writing circular genomes in Scaffolder by splitting the first contig so that the origin of replication appears first in the file.

I could add an 'origin:coordinate' attribute which would define where the genome should start in the fasta file. Would that match your suggestion?

ADD REPLY
0
Entering edit mode

This would be good to add, but I was thinking more of a global attribute that somehow describes that the scaffold is circular and that the last contig connects to the first.

ADD REPLY
4
Entering edit mode
13.7 years ago

In the manuscript, I cannot find a discussion of other, similar software. What do the sequencing centers use to get their scaffolds? In which way is scaffolder different from the existing software?

ADD COMMENT
0
Entering edit mode

I'd definitely say this is an important aspect of scene setting for the paper (Full disclosure: I'm editor in chief of ORC)

ADD REPLY
0
Entering edit mode

Thanks Max. AFAIK the only other option for generating scaffold from writing manual configuration files is BAMBUS - http://bit.ly/gplWvH. If you know any of any other software that does this suggestions are very welcome.

ADD REPLY
0
Entering edit mode

In short, scaffolder aims to take the manual process of producing a larger sequence from individual contigs and make it versionable and reproducible. Write the scaffold file, run scaffolder and you will always get the same output sequence.

ADD REPLY
0
Entering edit mode

Googled for "scaffolding software bioinformatics" and found this one: SSPACE http://www.ncbi.nlm.nih.gov/pubmed/21149342, the paper also states that SOAP and Abyss have their own "built-in" scaffolders.

In which way is your Scaffolder different from Bambus?

Are these different from Sopra (PMID20576136) ?

I remember that the stone-age tool consed must have had a textfile format to do the scaffolding. Is this different from SSPACE or Scaffolder?

Is there any good reason to deviate from the well-established AGP format?

ADD REPLY
0
Entering edit mode

I can't read the SSPACE article as it's behind a paywall. From the abstract it appears that SSPACE algorithmically joins separate contigs using paired read data. Similarly SOPRA also uses paired read data to join unassembled contigs into a larger sequence.

ADD REPLY
0
Entering edit mode

Scaffolder provides no algorithms for scaffolding as there are many tools for this already available. I've tried to review most of these in the introduction of the manuscript. The aim of scaffolder is instead to allow manual editing of genome scaffolds using the readable YAML syntax. The scaffold fasta sequence can then be reliably reproduced from this scaffold syntax file.

ADD REPLY
0
Entering edit mode

Compared with Bambus, Scaffolder focuses solely on allowing the manual editing and joining of contigs to produce a genome scaffold. I believe Scaffolder may also be easier to install since it only requires one command line call to the rubygems package management system.

ADD REPLY
0
Entering edit mode

Consed requires signing a academic user agreement and providing your IP address so that you can download the software. For commercial use the software has to be paid for. In comparison Scaffolder is open-source and MIT Licensed.

ADD REPLY
0
Entering edit mode

Comparison with the AGP format requires a longer description. Essentially though the AGP describes how scaffolds are composed of the constituent contigs but, as far as I know, there are no tools that can take AGP as input and produce the described scaffold as an output. Therefore you can't edit and build scaffolds using the AGP format as a base. Building a script to convert between scaffolder and AGP files would however allow this. I would also argue that YAML-based formats are easier to read and edit compared with tab-delimited formats.

ADD REPLY

Login before adding your answer.

Traffic: 2358 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6