Question

Forum:NGS Data Storage Best Practices (Clinical)

2

Entering edit mode

5.3 years ago

Robert Sicko ▴ 640

I'm in a lab that does clinical and research NGS and we have daily clinical MiSeq runs. I want to start purging raw MiSeq results to save space.

We know we want to retain raw data for awhile.I'm unsure if I should just create a script that copies the fastqs to an archive directory or a script that removes BCLs as they are the largest. I don't know if there is value in retaining the subdirectory structure of each project and the other smaller files (.xml, .log, .jpg, .locs, .txt - all add up to ~25% of the size of the BCL files).

I'm also curious what NGS files/results everyone stores and for what period of time.

I've found some guidance here and here

But I'm curious to know what actually occurs in practice.

Thanks,
Bob

NGS clinical storage • 3.5k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 5.3 years ago by Robert Sicko ▴ 640

score 1 · Answer 1 · 2019-12-13

You should at least store RunInfo.xml and runParameters.xml . The InterOp folder is also small but might be important at least until you are sure that the run is ok. But even the historic perspective could be interesting.

It would also be nice if Illumina could get their act together so that it would be possible to delete everything in basespace but the InterOp folder i.e. retain only the charting for every old run. Currently it still keeps tons of log files, even if you delete the run data to keep the record. I guess they want to earn some money for data-space rental.

score 1 · Answer 2 · 2019-12-13

1

Entering edit mode

5.3 years ago

GenoMax 150k

You would want to check with your clinical lab's Director with what needs to be done with clinical sequence data. If it has become part of a patient's EMR then it may need to be treated the say way other electronic diagnostic data is. That may include the whole data folder, even if you don't like it.

In grand scheme of things original data folders for MiSeq are not that big when they are tar-compressed (~30G for longest runs) so keeping a backup of that tar file would be your best option. If you ever need to re-process or re-produce data this would be your insurance policy.

ADD COMMENT • link 5.3 years ago by GenoMax 150k

0

Entering edit mode

Our director has advised storing fastqs in case we need to rerun downstream analysis etc, but she was OK with removing raw data after 2-3 months. We are a screening lab, so we're in a weird space of not technically being a diagnostic lab but we're using molecular tests and validating more NGS that is screening but definitely knocking on diagnostics door. All that to say, we retain record of Sanger validated variants that we report to physicians, but the raw sequencing results are not part of the EMR. My current thought based on the discussion here is: Dump raw MiSeq data on our server, automate bcl2fastq (cron job looking for complete runs every couple hours), queue data processing and finally run archival (either cron job that runs once a week or performed manually to create a tar archive of fastqs and InterOp folder data and then remove raw MiSeq data that is older than 2 months).

ADD REPLY • link 5.3 years ago by Robert Sicko ▴ 640

0

Entering edit mode

Our director has advised storing fastqs in case we need to reprocess

That is useful only for downstream analysis.

As @Ido had suggested you could keep a tar archive of InterOp directory and *.xml files. This allows for a limited QC examination using Illumina sequence analysis viewer software, if a need arises about data quality. I would suggest storing this separately in a separate archive.

ADD REPLY • link 5.3 years ago by GenoMax 150k

0

Entering edit mode

True, I edited my comment for clarity. Yes, we would only be saving the fastqs if we needed to variant call with an updated pipeline or something similar. Most of the reasons we've seen/I can think of for reprocessing from bcl files (sample sheet error, process failure during fastq generation) would be discovered and fixed in the days-week following MiSeq run completion, while we are validating variants identified and summarizing run results. Thanks, that is helpful. Have you run into cases where the raw bcl files were necessary/helpful a couple months after final clinical results were complete?

ADD REPLY • link 5.3 years ago by Robert Sicko ▴ 640

1

Entering edit mode

We don't do any clinical sequencing. We do store compressed raw sequencing data folders (and fastqs) per local policy.

ADD REPLY • link 5.3 years ago by GenoMax 150k