Hey,
The processing load does not look like it will be high, at just 8 samples per week. I could do that level of processing no problem on my £550GBP HP laptop. One question you'd need to consider is obviously long term storage and plans for scaling up the sequencing(?).
If the server is just going to be dedicated to NextSeq processing, then something that has 8 CPU cores, 16GB RAM, and 8TB (4x2) RAID hard-disks would be way more than sufficient. However, you could get away with 2 CPU cores and 8GB RAM, but processing would be slow. The first quoted numbers would at least allow you to process sample simultaneously (best just 2 at a time).
The bigger problem is the installation of an automated pipeline that can be faithfully re-run and that can tolerate problems (which always occur) and that allows much flexibility. If you're dealing with patient data whose privacy has to be protected under law, you may have to install everything locally and ensure that no data is even transmitted outside your domain at any time. In the UK and Europe, they are more stringent on this than in the USA, but you'll have to check the pertinent laws where you're based. In the UK, for example, processing patient data as part of a health service on something like Amazon is actually against the law, as defined by both UK government and the European Union. Research data is not under such a stringent regulation.
In terms of how the pipeline is built, it can either be as a BASH script or Python. Others are obviously possible (like JAVA), but are less common. They will typically be capable of shifting data around different servers too, and keeping logs of each analysis.
In the past, I have installed automated pipelines of this kind in the National Health Service (NHS) England and also the private setting. Each situation requires different levels of customisation based on the end-users' needs.
Trust this helps.
Edit: although the RAID in the server disks will ensure some level of redundancy for the data, you may consider backups of other forms, too. In the NHS lab where I was based, we did weekly tape-drive backups of all servers, but we're talking about ~8 servers of various types here. Ultimately, you'll need to (on top of everything else) define a data storage and backup strategy, and maintain lists of SOPs with future review dates.
"The bigger problem is the installation of an automated pipeline that can be faithfully re-run and that can tolerate problems (which always occur) and that allows much flexibility." I have been tackling this exact problem with my
snsxt
pipeline framework here. Due to the issues you mentioned, and the usage of our pre-existing institution's HPC, I had to basically build everything from scratch tailored to our system.Looks great - will take a look at that. I have not yet put my code on GitHub.
Pegasus and Common Workflow Langauge both also appear to be promising tools for this task.