Question

What is the best way to transfer HiSeq data from the instrument to compute resources and/or storage?

5

Entering edit mode

9.8 years ago

Keith Callenberg ▴ 960

Since Illumina's systems are Windows-based, it seems the options for mounting the drives and automating the transfer of sequencing data are a bit more limited.

How do you move your data to your cluster or compute resources? Do you add a Samba mount to your cluster on the HiSeq? A cifs-based mount on your cluster? Use scp/sftp or an intermediate Linux machine? Do you see any I/O issues resulting in missing files with your process?

And do you save the data directly on your mount as it is generated by the sequencer, or do you transfer it after the run is complete?

I am looking for a reliable and automatable method that could be part of a clinical pipeline that processes specimens daily.

While this may seem like just an IT question, I think the NGS context is important because of the size and type of data so it would be lost on Stack Overflow. Thanks in advance!

next-gen sequencing HiSeq • 3.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Keith Callenberg ▴ 960

0

Entering edit mode

We use scp. It takes whole night for the data transfer. I am also looking for best solutions.

ADD REPLY • link 9.8 years ago by GouthamAtla 12k

Ram · Answer 1 · 2015-04-01

We had a heck of a time with this. We don't archive the Thumbnail_Images (too big and not important for fastq generation), and the most of the rest of the data is in bcl format (like > 95%). So I'd suggest rsyncing the bcl files as they are completed, but there are some caveats since some the bcl and ancillary stat files are not completed upon their creation. Fortunately, Illumina creates sentinel files that can be used to detect when the bcl and stat files are complete. This fact does not seem to be well publicized, and I can not longer find the document (it's called HCS 1.4/RTA 1.2 Theory of Operation, and I have a hard copy). This sentinel file should be located in

Computer/${drive}/Illumina/HiSeqTemp

When a run starts, a ${flowcell_dir} is created in this directory. Now after the base call file is created in the cycle directory AND quality scores are later added to that file (the file is now complete), an empty file called Computer/${drive}/Illumina/HiSeqTemp/${flowcell_dir}/Processed/L00?/CX.1/${filename_header}.qms is touched. Thus once this file is touched, we know we can sync the following files in Computer/${drive}/BCLDATA_${side}/${flowce_dir}/Data/Intensities/BaseCalls/L00?/CX.1:

${filename_header}.bcl
${filename_header}.stat

So you can use this as a hook to know when the bcl data is complete in real time capturing the majority of the heavy lifting. Then you just insure it copies over properly... no small task. Right now, we use globus.