I'm in a lab that does clinical and research NGS and we have daily clinical MiSeq runs. I want to start purging raw MiSeq results to save space.
We know we want to retain raw data for awhile.I'm unsure if I should just create a script that copies the fastqs to an archive directory or a script that removes BCLs as they are the largest. I don't know if there is value in retaining the subdirectory structure of each project and the other smaller files (.xml, .log, .jpg, .locs, .txt - all add up to ~25% of the size of the BCL files).
I'm also curious what NGS files/results everyone stores and for what period of time.
I've found some guidance here and here
But I'm curious to know what actually occurs in practice.
Thanks,
Bob
Our director has advised storing fastqs in case we need to rerun downstream analysis etc, but she was OK with removing raw data after 2-3 months. We are a screening lab, so we're in a weird space of not technically being a diagnostic lab but we're using molecular tests and validating more NGS that is screening but definitely knocking on diagnostics door. All that to say, we retain record of Sanger validated variants that we report to physicians, but the raw sequencing results are not part of the EMR. My current thought based on the discussion here is: Dump raw MiSeq data on our server, automate bcl2fastq (cron job looking for complete runs every couple hours), queue data processing and finally run archival (either cron job that runs once a week or performed manually to create a tar archive of fastqs and InterOp folder data and then remove raw MiSeq data that is older than 2 months).
That is useful only for downstream analysis.
As @Ido had suggested you could keep a tar archive of
InterOp
directory and*.xml
files. This allows for a limited QC examination using Illumina sequence analysis viewer software, if a need arises about data quality. I would suggest storing this separately in a separate archive.True, I edited my comment for clarity. Yes, we would only be saving the fastqs if we needed to variant call with an updated pipeline or something similar. Most of the reasons we've seen/I can think of for reprocessing from bcl files (sample sheet error, process failure during fastq generation) would be discovered and fixed in the days-week following MiSeq run completion, while we are validating variants identified and summarizing run results. Thanks, that is helpful. Have you run into cases where the raw bcl files were necessary/helpful a couple months after final clinical results were complete?
We don't do any clinical sequencing. We do store compressed raw sequencing data folders (and fastqs) per local policy.