Question

hardware requirement of genome assembly

1

Entering edit mode

6.2 years ago

cardinal2818413 ▴ 10

I would like to purchase powerful machines to assemble a genome (a few Gigabases) similar to the following one.

The opium poppy genome and morphinan production http://science.sciencemag.org/content/early/2018/08/29/science.aat4096.full

Does anybody have any recommendation on the hardware configuration, e.g., the number of CPUs, memory size, one big machine vs multiple smaller machines, solid state drives vs hard disk drives? Thanks.

Assembly • 3.1k views

ADD COMMENT • link updated 3.9 years ago by Biostar 20 • written 6.2 years ago by cardinal2818413 ▴ 10

0

Entering edit mode

Hardware is one aspect of which RAM would be a big component. Get as much as you realistically can. Human genome sized data can take upwards of 1-3 TB depending on the amount/type of sequence data you are trying to assemble.

So the important thing you left out of the post is what type of data you are going to generate and how much of it? Are you planning both short/long reads.

ADD REPLY • link 6.2 years ago by GenoMax 147k

0

Entering edit mode

I plan to use both short and long reads. How will it change the hardware requirement by using both short and long reads instead of just short reads?

ADD REPLY • link 6.2 years ago by cardinal2818413 ▴ 10

1

Entering edit mode

Short reads can be best assembled on a single powerful machine. Eg. 1-2 TB RAM, 64 cores.

Long read assemblies are FAR FAR better. They need a cluster for best results, since a lot of CPU is needed for long read correction, overlapping and assembly. Or you could wait a long time (weeks) using a single machine. You would also need at least 1 512GB RAM machine for this job.

The short reads can be used to correct the long read assembly.

How about asking for existing resources at your local compute facility before buying yourself ?

ADD REPLY • link 6.2 years ago by colindaven 7.0k

1

Entering edit mode

Would be my advise as well, save up the money to buy yourself CPU time on a (huge) shared infrastructure. Will be much more flexible/efficient then investing in your own local machine (not even taken the maintenance, server room etc into account)

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

Would the transfer of a large amount of data to a shared infrastructure be a bottleneck when the connection is slow?

ADD REPLY • link 6.2 years ago by cardinal2818413 ▴ 10

0

Entering edit mode

Depending on the amounts of data that need to be transferred, yes indeed that can be an issue.

On the other hand it might still not pay off for the benefits that solution has, Moreover if you're working with that kind of substantial datasets the extra "waiting" for transferring data will then be marginal compared to the required runtime for the software

ADD REPLY • link 6.2 years ago by lieven.sterck 15k