Question

Bowtie multi-threads and SSD vs spindle disk performance

0

Entering edit mode

9.8 years ago

agbiotec • 0

Hello,

I was wondering if anyone knows any studies (papers conference / journal, blog, other) for bowtie or bowtie2 performance in multi-core (core i7 6x) systems, and how performance is affected with using SSDs vs spindle disks. Also how much memory ideally per core, when running multi-threaded alignment against human genome indexes.

I am looking to build a box for the lab, and trying to figure out 2 things : how much memory per core, and whether there is significant advantage with SSD drives (given that TB storage is required for our sequencing and SSD is expensive!).

Also any ideas the group might have if SSD offers 100x speedup for bowtie, how to combine with large spindle storage. I am basically looking for a tower box in the lab, and while I could get an external spindle disk array and keep the SSD in the box, I want to avoid staging data back and forth.

hpc next-gen computing alignment • 4.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.8 years ago by agbiotec • 0

0

Entering edit mode

What is the workflow you will be supporting downstream of bowtie2? I've found the alignment steps to not typically be my bottleneck. There could be a big difference in the answers provided depending if you are ultimately just doing the alignments versus if you are doing mostly genotyping/variant discovery downstream, transcriptomics, etc.

SSD will only provide speed-ups to the portion of your workflow that are I/O intensive. In my experience mappers tend to do a lot in memory before writing out to SAM/BAM files so I'm not sure you would see a lot of speedup at the bowtie2 stage.

ADD REPLY • link 9.8 years ago by DG 7.3k

0

Entering edit mode

Thank you for your reply, downstream it will be TopHat / Cufflinks (typical pipeline for differential expression using RNAseq data).

ADD REPLY • link 9.8 years ago by agbiotec • 0

Ram · Answer 1 · 2015-11-17

Here is a paper in ArXiv profiling speed-ups from SSD drives in a variety of bioinformatics workflows, including RNa-Seq: http://arxiv.org/abs/1502.02223

And a post from Brad Chapman: http://bcb.io/2014/12/19/awsbench/. Brad's isn't specific to SSD as it is a large-scale benchmark on Amazon using Docker containers but the use of SSD storage by amazon on the backend for their high-end file systems is one factor that goes into the speed there so it might be worth reading. It also gives you some ideas of costing with Amazon for renting large-clusters and storage, which is something to seriously consider.

If you are looking at building a single machine to use in the lab the first thing to consider is throughput. What size sequencer are you supporting and is that sequencer in the lab or just one that you have an affiliation with and will be using a lot? If you're supporting a HiSeq in production you need some serious hardware investment, and a single server, no matter how large, just won't do. So even with the smaller machines (MiSeq for instance) how many runs you expect to do per month or per year factors into deciding how much storage space you really need.

In general I would recommend as much RAM as you can afford and as many processing cores as you can afford at a reasonable speed. You probably want to have at least 10-12GB RAM/processing core as the minimum. Definitely intel chips, the newer V3 specs if possible. Some tools have been compiled using the intel compilers and can take advantage of the newer AVX instruction set and offer significant speed-ups.

You're right that probably you want to stage your data storage in tiers. Use SSDs for the active processing and storing all of the resource files (reference genome, etc) that you use as read-only constantly. You may even want these separate from one another. Then fill the system up with as much storage on regular spinning disk as you can afford, and decide what RAID level or other method of redundancy you want to use.

If you want to look at getting boat-loads of storage at a low price point I'd recommend a company called 45Drives: http://www.45drives.com/. 45 Drives is the company that was working with the online storage company Backblaze to build ultra-dense storage solutions. They are becoming quite widely known in the genomics community as well, and a number of sequencing centres are using their gear. Basically you can fit 45 drives (commodity you buy yourself or they also resell Western Digital datacentre drives) into a single chassis. With expensive 8TB drives thats 360TB of raw storage. More reasonably priced 4TB drives you can get up to 180 TB. They don't currently support SSDs though, except I think as the two OS drives you can have in the system in addition to the storage array. Full disclosure I'm also a recent customer of theirs but have no other incentive or anything like that for recommending them. They do custom gear and builds as well, so you could always design a pretty hard-core computing server with massive amounts of storage in one box or you could buy one off the rack as a very high-density NAS server connected to your processing server. I'm sticking three of their units into a cluster with other compute nodes myself.