Question

Help Choosing a Workstation for a New Bioinformatics Service

0

Entering edit mode

10 months ago

AresSanchz • 0

Hello everyone,

I am setting up a new bioinformatics service at a research institute (79 research g), each potentially requiring different types of analyses. I'm currently in the process of selecting a workstation that can handle multiple, high-throughput bioinformatics tasks and biostatistics efficiently and simultaneously.

Here are some key points I’m considering:

CPU: I’m thinking about a high core-count processor (e.g., AMD Threadripper or Intel Xeon). Is 16-32 cores sufficient, or should I aim higher?
RAM: Considering 128 GB of RAM, with the option to upgrade to 256 GB if needed. How much RAM would you recommend for handling typical omics analyses from multiple groups at once?
Storage: I’m planning to use 1-2 TB NVMe SSD for fast access and temporary files during analysis, along with 8-12 TB HDD (possibly in RAID) for long-term data storage. Does this seem like the right balance between speed and capacity, or should I be aiming for more SSD?
GPU: I’m unsure if investing in a high-end GPU (e.g., NVIDIA RTX 4090) is necessary. Is it worth the cost if we're not heavily focused on machine learning or deep learning applications right now?
Simultaneous Analysis: We expect to run several analyses at once (RNA-seq, metagenomics, etc.). How should this impact my choice in hardware specs?

If anyone has experience with setting up bioinformatics workstations for a similar scale, I’d really appreciate your advice on specs and whether there's anything crucial I'm missing. Also, would you recommend consulting a hardware specialist to ensure the specs meet our needs?

Thank you so so so much in advance for your insights!

NGS hardware workstation • 4.0k views

ADD COMMENT • link updated 10 months ago by i.sudbery 21k • written 10 months ago by AresSanchz • 0

2

Entering edit mode

Do I read it right that you are going to set up a single workstation for 79 research groups? That is not going to be a good solution for so many.

ADD REPLY • link 10 months ago by GenoMax 152k

0

Entering edit mode

Hi! Yes, you read that right. But they want to start really slow, like beginning with just a few analyses for the few IPs that are in the same building as me. For now, they’re only giving me a regular office computer, so... I know it’s not going to be easy. I guess when they see that I can’t manage, they’ll give me more support.

ADD REPLY • link 10 months ago by AresSanchz • 0

2

Entering edit mode

You don't say where you're based but if in the EU, you most likely have access to free academic HPC and/or cloud resources. Anyway, one machine isn't going to be enough for 79 groups unless they only use it very sporadically or they are OK waiting until it's available.

ADD REPLY • link 10 months ago by Jean-Karim Heriche 27k

0

Entering edit mode

Yes, it's EU-based, that's the thing. I'm trying to get access to the university servers, so I'm looking for a workstation to get started. They don’t have a bioinformatics service yet, and they’re unsure how much it will be used, so I guess until all the IPs get accustomed to it, I won’t have the amount of work that a typical 79-group institute would have.

ADD REPLY • link 10 months ago by AresSanchz • 0

2

Entering edit mode

If you're getting one machine for a group of 79 researchers at an institute, and its a success, then within a couple of years machine #2 will be on the way.

That's why I'd recommend learning Ansible to configure the machine. You develop a template or set of templates which can be applied to any number of machines. Its very useful and a pre-requisite for setting up a cluster. Alternatives are chef and puppet.

ADD REPLY • link 10 months ago by colindaven 7.7k

0

Entering edit mode

Thanks for the suggestion. I'm not familiar with Ansible (i'm still a rookie), but I'll take a look and see if it could be a good option for this situation. I'll keep you posted!

ADD REPLY • link 10 months ago by AresSanchz • 0

0

Entering edit mode

Thank you all so much for your responses. I know I don’t provide much information, but I really don’t have much more to share. These 79 groups have no bioinformatics support, so until now, any analyses they needed have been contracted out to external companies. The idea is to gradually introduce this service, and for now, I will start by talking to about 10-15 researchers to see how I can help them. Right now, we’re trying to get support from a supercomputing group at a university, with the intention of working with their servers. That’s why my question was about a fairly simple workstation that would allow me to get started. However, I also need something that will last me at least a year, because I don’t want to request an $8K workstation that only lasts two months. In other words, since it’s an investment, I want the workstation to be useful for a while. I’m not sure if this is still not enough information, but the truth is that I’m such a newbie that I don’t even know how much I can share in an open forum. Sorry for the confusion, and thank you again!

ADD REPLY • link 10 months ago by AresSanchz • 0

0

Entering edit mode

I don’t want to request an $8K workstation that only lasts two months.

The workstation will last a long while. Way after its useful life is over, provided there are no hardware failures. It can always be used to connect to other resources as you get access to them off-campus. Generally if something is to fail it will likely do so within the first year. Post that period things should keep working well as long as the hardware is taken well care of.

Main take home (in case it was not apparent in the discussion here) is you can't do analysis that requires resources beyond those available. @i.sudbery has given you good pointers on what is minimally needed to run one sample through various commonly used programs. So find the most intensive analysis you are likely to do on one sample and make sure your hardware meets that minimum requirement (with a healthy overhead capacity) to allow you to do some other smaller jobs while the main analysis is running.

If bioinformatics research support is going to be your primary responsibility then consider joining "bioinfo-core" organization (LINK) or ABRF. You will find resources, meet like minded people and get up to speed rapidly.

ADD REPLY • link 10 months ago by GenoMax 152k

score 4 · Answer 1 · 2024-09-12

4

Entering edit mode

10 months ago

Istvan Albert 102k

We have recently purchased a system that serves us very well.

The first thing you should max out is the memory; you should aim for 1 TB or more.

Many processes are memory-intensive, and you can't use your CPUs without enough memory.

Then, what I would recommend is setting up hard drives with an equal amount of fast-access "scratch" space and a redundant but, hence, a bit slower reliable partition. We run our processes on the fast scratch space and store long-term files on the RAID partition. In general educate your users that long term backup is their responsibility not yours. Portable multi-TB drives can be bought for cheap by each group, which will be massively cheaper than any kind of centralized backup.

The problem with 1TB partitions is that, depending on usage, they fill up extremely fast, and then it is a never-ending battle of clearing up space and reconciling competing needs.

You also want to max out the hard drive space as much as possible.

The number of CPUs are not usually the bottleneck. We have 96 cores, but we rarely run them all. I think other bottlenecks, for example IO limitation would kick in well before we could utilize all the cores.

I have a chapter on this topic in the Biostar Handbook titled How much computational power do you need, you would need to have book access to read it

https://www.biostarhandbook.com/books/workflows/guides/how-much-power/

but here is a small relevant excerpt:

The cost of our HPC was 27,000 USD planned to last 5 years. Thus, technically speaking, running it costs about 60 cents per hour (15 USD per day).

Considering that, we used about a third of its computing resources, the cost of running a 30x human genome variant call on our HPC was around 1 dollar and finished in 7 hours.

ADD COMMENT • link 10 months ago by Istvan Albert 102k

2

Entering edit mode

I agree about memory being more important that cores - memory determines what you can and can't do, cores only determine how fast you can do it!

Some numbers that might be relevant:

For RNAseq, we run Salmon with 8 cores, and 16GB RAM and a sample takes around 20 minutes - 1 hour. If an average experiment has 24 samples, then 32 cores/64GB of RAM would theoretically allow quantification in around half a working day.
For STAR we use 8 cores and 32GB of RAM and that takes around 3-4 hours per sample. Non-splicing aligners use significantly quicker and use less memory.
For somatic SNP calling on the human genome we run GATK on 16 cores, 128GB RAM and a sample takes about 24hrs. I suspect the RAM here is overkill, but this is what we allocate.
CellRanger (for single cell RNA seq) requires 16 cores and 64GB RAM per sample.

ADD REPLY • link 10 months ago by i.sudbery 21k

1

Entering edit mode

Thus, technically speaking, running it costs about 60 cents per hour (15 USD per day).

Does that number include electricity/cooling/data backups and the cost of time of a person who is doing the systems administration.

ADD REPLY • link 10 months ago by GenoMax 152k

1

Entering edit mode

I expand on this point in the book of how there are other substantial costs - but those might not manifest as direct costs.

I also mention that the cost of hosting the same computational resources on the cloud is over 100x higher (if not more).

I assumed the poster was already from an institution, and it seems they do not have to factor in the other infrastructure.

PS: backups: everyone should back up their own data. Over the years, I learned that institution-wide backups of generated data is a waste of resources. The costs are staggering, and it is practically never needed. The data can be stored at SRA after publication. Before publication most groups can back up a lot of data on 5TB portable drive.

ADD REPLY • link 10 months ago by Istvan Albert 102k

2

Entering edit mode

For a newbie who has no idea about how to administer a server/workstation it can be quite the challenge. While this can be interesting to learn for a subset of people it takes time away from productive research. It may also run afoul of local security policies. Many universities (no companies likely allow non IT-users to do this anyway) now require that servers be administered by people who are vetted/certified for this role.

most groups can back up a lot of data on 5TB portable drive.

It would depend on what they are doing. A complete genomics machine can easily generate over 5TB of sequence data from a single run with all 4 flowcells.

everyone should back up their own data. Over the years, I learned that institution-wide backups of generated data is a waste of resources.

Easier said than done and again likely not allowed by institutional policies. NIH is now holding the parental institutions of a PI with NIH grants responsible to manage data from NIH grants. Failure to store data (for up to 7 years) and failing to make it available can be subject to penalties. A specific data management and sharing plan is now mandatory for NIH gratns.

ADD REPLY • link 10 months ago by GenoMax 152k

1

Entering edit mode

s. NIH is now holding the parental institutions of a PI with NIH grants responsible to manage data from NIH grants. Failure to store data (for up to 7 years) and failing to make it available can be subject to penalties. A specific data management and sharing plan is now mandatory for NIH gratns.

Same with UKRI in the UK, except its 10 years, not 7.

We have used about 60TB of data in total, but we are a bioinformatics group, not a lab group using some bioinformatics. Thats not including backup - the price we pay for all of our datastores include automatic back-up, so I guess we are using more like 120TB in total (as in there are two copies of everything).

We could probably bring that down by about 5TB if we deleted copies of raw data that is now archived on SRA, but much of it is storage of final and intermediate analysis products. In theory those should be regeneratable from the raw data and the code. In practice, its much easier to store it, especially if you think the probability of going back to it is quite high.

ADD REPLY • link 10 months ago by i.sudbery 21k

0

Entering edit mode

This link here states that the retention needs to be for 3 years and not 7

https://sharing.nih.gov/data-management-and-sharing-policy/data-management#length-of-time-to-maintain-and-make-data-available

and once something is published, the data is in SRA, so that it is basically backup up by NCBI

I still think the best backup solution by far is a portable hard drive, especially as the data grows. Just get a larger drive.

A 16TB portable hard drive costs 250 dollars. Backing up 16TB on amazon AWS costs $365 per month!

Edit, as I looked at it, there were many options for storage on Amazon, here are the monthly costs for storing 16TB:

S3 ~ $300
S3 infrequent access ~ $165
Glacier Flexible ~ $50
Glacier Deep ~ $25

but I don't exactly know what these other options mean. Setting up a Deep Glacier would be cost competitive with a portable drive.

But then, just as with sys admins for your Linux system, you now need someone to set this thing up and maintain the Deep Glacier storage.

ADD REPLY • link 10 months ago by Istvan Albert 102k

0

Entering edit mode

Setting up a Deep Glacier would be cost competitive with a portable drive.

Tiers like glacier deep are for data needs to be accessed only in case of a disaster i.e. equivalent of old school tape backups without the need to maintain local hardware, buy tapes etc. Done properly, critical backups would need to be in different zones (in cloud) and/or even span providers (a copy with AWS/Google/MSFT etc).

you now need someone to set this thing up and maintain the Deep Glacier storage.

If you are backing up for an enterprise, then yes. But otherwise it should be technically less daunting for an individual to set up.

This link here states that the retention needs to be for 3 years

There is a subtle requirement that is easy to miss .

According to the NIH Grants Policy Statement, grantee institutions are required to keep NIH data for three years after the close of a grant or contract agreement

So that ends up being ~6-7 years assuming that grant is not renewed. Large RO1's are multi renewals. Data containing PHI (which many institutions consider sequence to be) need to be kept for 6 years after the subject signs a consent for research.

ADD REPLY • link 10 months ago by GenoMax 152k

0

Entering edit mode

Interesting. THe UK and EU policies are both 10 years after the end of the grant, so you generally need to budget for 13-15 years in total (this includes everything, including lab notebooks, biological samples, constructs etc, not just computational data). There is also an expectation that this will be organised and managed by your institution, not by the researches themselves.

ADD REPLY • link 10 months ago by i.sudbery 21k

0

Entering edit mode

Cloud would be much more expensive is you run if continually, the way you would run an on-prem system. But if, for example, you spec a machine to be able to handle ChIP/RNA-seq, and then occassionally have to rent cloud to do Whole Genome, it can be cost effective.

ADD REPLY • link 10 months ago by i.sudbery 21k

0

Entering edit mode

Thank you for the detailed advice, it’s really helpful! I’ll definitely look into maximizing memory and hard drive space and the fast-access setup as you suggested. Thank you so much for the chapter as well, I really need that help so I'll study evry point of it.

ADD REPLY • link 10 months ago by AresSanchz • 0

score 1 · Answer 2 · 2024-09-18

Few points:

any system without a work scheduler (i.e SLURM) can be maxed out (CPU, RAM, I/O) fairly quickly.
mixing storage and computational loads on the same machine can have negative effect: people may want just to upload data/transfer results which on a heavily loaded machine can be very slow. If the machine is overwhelmed with loads/locked up => no storage access...
even with a single research group and a server with SLURM on it there is a tendency to "just test things" on the command line outside of the scheduler in the worst case scenario crashing the server. So a submission node with minimal resources and no login on worknodes outside of srun can help.
as pointed above the first step should be to investigate existing academic clusters/supercomputing centers. Often it makes way more sense to get an access to an existing computing cluster than to hire cluster admin with a lovely perspective of providing the end user support for tens of research groups.