Entering edit mode
6.2 years ago
cardinal2818413
▴
10
I would like to purchase powerful machines to assemble a genome (a few Gigabases) similar to the following one.
The opium poppy genome and morphinan production http://science.sciencemag.org/content/early/2018/08/29/science.aat4096.full
Does anybody have any recommendation on the hardware configuration, e.g., the number of CPUs, memory size, one big machine vs multiple smaller machines, solid state drives vs hard disk drives? Thanks.
Hardware is one aspect of which RAM would be a big component. Get as much as you realistically can. Human genome sized data can take upwards of 1-3 TB depending on the amount/type of sequence data you are trying to assemble.
So the important thing you left out of the post is what type of data you are going to generate and how much of it? Are you planning both short/long reads.
I plan to use both short and long reads. How will it change the hardware requirement by using both short and long reads instead of just short reads?
Short reads can be best assembled on a single powerful machine. Eg. 1-2 TB RAM, 64 cores.
Long read assemblies are FAR FAR better. They need a cluster for best results, since a lot of CPU is needed for long read correction, overlapping and assembly. Or you could wait a long time (weeks) using a single machine. You would also need at least 1 512GB RAM machine for this job.
The short reads can be used to correct the long read assembly.
How about asking for existing resources at your local compute facility before buying yourself ?
Would be my advise as well, save up the money to buy yourself CPU time on a (huge) shared infrastructure. Will be much more flexible/efficient then investing in your own local machine (not even taken the maintenance, server room etc into account)
Would the transfer of a large amount of data to a shared infrastructure be a bottleneck when the connection is slow?
Depending on the amounts of data that need to be transferred, yes indeed that can be an issue.
On the other hand it might still not pay off for the benefits that solution has, Moreover if you're working with that kind of substantial datasets the extra "waiting" for transferring data will then be marginal compared to the required runtime for the software