Constructing server for clinical setting NextSeq
3
2
Entering edit mode
7.1 years ago
timijan ▴ 20

Hi everybody,

I am working in an oncology diagnostics lab, where panel sequencing is done on MiSeq. However, we are now in a process of purchasing a NextSeq. We would like to build a server for data processing. We are planning to sequence Illumina's kit TruSightTumor 170 8 samples per run, 1 run/week. We are wondering what kind of computer power (hardware) do we need in order to process all of the data (bcl2fastq, alignment, variant calling, CNV analysis)? We don't need a temporary solution, but rather a permanent, future proof one.

Basespace solution is sadly not an option for us, due to patient information in the files.

Thank you for all of your answers!

RNA-Seq next-gen nextseq • 2.6k views
ADD COMMENT
3
Entering edit mode
7.1 years ago
GenoMax 147k

If you don't have access to in-house information technology support then you may want to consider looking into "BaseSpace Onsite" solution from Illumina. This would allow you to keep the data processing in house but with access to push button convenience of being able to run downstream analysis seamlessly. Since you are going to run one standard type of data analysis it should be directly supported by Illumina on BaseSpace.

If you do have in house IT support then Basespace Onsite can still be considered. In addition you can have the option of purchasing hardware in consultation with your IT. If they already have high performance compute systems available then leveraging those would be the way to go. In any case, IT can manage normal systems administration, data backup (assuming they provide that support) and security (which should be a big consideration). Your IT will have preferred vendors so they can help save you some money on the hardware. The compute requirements for running bcl2fastq are specified here. Depending on additional analysis you expect adjust the hardware specs (2-3x, with adequate amounts of RAM) to provide for some future proofing.

ADD COMMENT
2
Entering edit mode
7.1 years ago

Hey,

The processing load does not look like it will be high, at just 8 samples per week. I could do that level of processing no problem on my £550GBP HP laptop. One question you'd need to consider is obviously long term storage and plans for scaling up the sequencing(?).

If the server is just going to be dedicated to NextSeq processing, then something that has 8 CPU cores, 16GB RAM, and 8TB (4x2) RAID hard-disks would be way more than sufficient. However, you could get away with 2 CPU cores and 8GB RAM, but processing would be slow. The first quoted numbers would at least allow you to process sample simultaneously (best just 2 at a time).

The bigger problem is the installation of an automated pipeline that can be faithfully re-run and that can tolerate problems (which always occur) and that allows much flexibility. If you're dealing with patient data whose privacy has to be protected under law, you may have to install everything locally and ensure that no data is even transmitted outside your domain at any time. In the UK and Europe, they are more stringent on this than in the USA, but you'll have to check the pertinent laws where you're based. In the UK, for example, processing patient data as part of a health service on something like Amazon is actually against the law, as defined by both UK government and the European Union. Research data is not under such a stringent regulation.

In terms of how the pipeline is built, it can either be as a BASH script or Python. Others are obviously possible (like JAVA), but are less common. They will typically be capable of shifting data around different servers too, and keeping logs of each analysis.

In the past, I have installed automated pipelines of this kind in the National Health Service (NHS) England and also the private setting. Each situation requires different levels of customisation based on the end-users' needs.

Trust this helps.

Edit: although the RAID in the server disks will ensure some level of redundancy for the data, you may consider backups of other forms, too. In the NHS lab where I was based, we did weekly tape-drive backups of all servers, but we're talking about ~8 servers of various types here. Ultimately, you'll need to (on top of everything else) define a data storage and backup strategy, and maintain lists of SOPs with future review dates.

ADD COMMENT
1
Entering edit mode

"The bigger problem is the installation of an automated pipeline that can be faithfully re-run and that can tolerate problems (which always occur) and that allows much flexibility." I have been tackling this exact problem with my snsxt pipeline framework here. Due to the issues you mentioned, and the usage of our pre-existing institution's HPC, I had to basically build everything from scratch tailored to our system.

ADD REPLY
0
Entering edit mode

Looks great - will take a look at that. I have not yet put my code on GitHub.

ADD REPLY
0
Entering edit mode

Pegasus and Common Workflow Langauge both also appear to be promising tools for this task.

ADD REPLY
2
Entering edit mode
7.1 years ago
steve ★ 3.5k

Our lab does exactly this.

Our institution already had an HPC cluster (64 nodes w/ 32 cores, 256GB RAM each), with large network attached storage. So we are using that for all the data storage and processing. Data from the NextSeq goes directly to the network storage. From there, we run all the demultiplexing and analysis ourselves on the compute cluster (details here).

As for required compute power, this is a rough estimate of how we have it configured:

  • bcl2fastq: one compute job with ~8 cores, takes ~4-6 hours to complete

  • alignment, unpaired variant calling, CNV analysis: one compute job per sample with 8-16 cores, takes ~24 hours for them all to finish. That is running every sample in parrallel. We typically have 24 samples per run. So in total thats 8-16 x 24 CPU threads

  • paired tumor-normal variant calling: one compute job per sample with 1 core each (running single threaded for this), takes ~30+ hours to complete, when running all samples in parallel (so 24 CPU threads total). Currently exploring methods in the thread linked there to parrallelize this which will reduce the total time to ~6hrs but require 25 CPU threads per sample (600 threads total).

In regards to data storage, each NextSeq run is roughly ~80GB, and the analysis produces 350GB+ but can be pared down afterwards by removing intermediary files (we still keep them though). In total this comes out to ~500GB+ of data per run.

We had originally debated on getting a dedicated HPC or server just for this, but it was decided that the overhead of having to manage it was going to be too much and it was better for us to utilize the one we already had available through our institution.

ADD COMMENT

Login before adding your answer.

Traffic: 1714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6