How Easy Is It To Carry Out De Novo Sequence Assembly?
2
1
Entering edit mode
10.8 years ago
a1ultima ▴ 850

This is not so much a question of HOW, but more of a question of "is it worth the hassle?".

Today a colleague of mine asked the following question:

" Assuming I need to build from 0, a chromosome of a fish, with short reads but no other reference whatsoever [de novo assembly]:

  • how much work is that?
  • Is there a generic software (like SAMtools) that will align the reads in a scaffold one can use?
  • Basically, given a reasonably clear pipeline in terms of software, is it still blood sweat and tears or is it just a matter of getting it on a cluster?"

Very grateful for any suggestions, sources of information (papers), software etc.

general genome sequencing • 3.5k views
ADD COMMENT
2
Entering edit mode

In general, with assembly, you should try at least a few different assemblers. One thing you need an excess of is RAM, like say at least 128 GB but more would be better (obviously depends on the size of the genome, the number and length of reads, etc.).

edit. You could start your reading from here..

ADD REPLY
1
Entering edit mode

If you search for "denovo assembly" on biostar you will see that there are tons of tools for de novo assembly (but SAMtools is not one of those!!!). It's a fair amount of work. You need RAM, as 5heikki correctly pointed out. You need to evaluate your results, but even before than that, you probably want to build a combined set of libraries (small insert, large insert, maybe overlapping reads).

ADD REPLY
2
Entering edit mode
10.8 years ago

de novo assembly usually requires a lot of optimization. To a certain extent, there are tools like VelvetOptimiser (and Oases seems to automatically do this sort of task with Velvet) that can help, but a good result typically requires the coordinated effort from multiple programs.

VelvetOptimiser: http://bioinformatics.net.au/software.velvetoptimiser.shtml

As an example of the general principle, here is a pipeline for herpesvirus assembly that I found produced a much better result than any de novo assembly tool by itself. That said, I think it is optimized for herpesvirus assembly. It may not be ideal for a much larger fish chromosome.

http://genomics-pubs.princeton.edu/prv/

ADD COMMENT
0
Entering edit mode

Hmm I see, well I will send this on thank you. May I ask very roughly how long these procedures take to carry out? A very rough range of time would be better than nothing cheers!

ADD REPLY
1
Entering edit mode

That can be somewhat tricky to provide - it will vary greatly depending upon how many reads you have (and the expected contig size)

When I was able to run the herpesvirus pipeline without crashing, it took a few days to run each sample. However, I had to sub-sample my reads to avoid crashing (I think I had to reduce my input to less than 20 million reads, if I remember correctly), and the contig length will probably be smaller than yours (I'm assuming you are working a chromosome that is >>100,000 bp). Some programs will just stall without stopping with an error message. My guess is that a successful pipeline should be able to complete within a week or so, but it is possible that I am underestimating the run time for your own data. However, I would consider finding a server with the largest possible amount of RAM you can find to be the highest priority.

ADD REPLY
1
Entering edit mode
10.8 years ago

De novo, with short reads only? With a vertebrate? Even with sufficient computing power, what you will get are a whole lot of contigs, probably hundreds of thousands, if not more. There is no magic pipeline that will do better than that with short reads alone.

ADD COMMENT
0
Entering edit mode

Cheers for the insight, I will pass this message on with the others posted here.

ADD REPLY

Login before adding your answer.

Traffic: 1135 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6