In our lab, members routinely generate NGS data, and we receive sequences from other labs. The sequences are stored on a NAS server. However, there is no centralized scheme for storing the sequences, so over time many sequences become "orphaned" - it's not clear which experiment they're from, when were they generated, or who generated them.
A low-tech approach (keeping a spreadsheet with the metadata) seems unenforceable and error-prone. It would be nice to have a tool that integrated with the NAS's file-system and stored meta-data regarding the sequences. Is there such a tool? And how do you manage your sequence libraries?
as Peter Cock suggested :
A nice idea would be to put your reads into an unsorted BAM instead of a standard FASTQ : one can add any number of metadata in the BAM header
See also the reply from the GATK developers confirming they use unmapped SAM/BAM in production, and recommend this over FASTQ in their Best Practices documentation:
I agree. I haven't yet done this myself as I am just getting things up and going for both my own lab and a clinical NGS service. I'm definitely planning on doing unsorted SAM/BAM for long-term retention of the "raw" sequence data versus the original FASTQ files. Being able to encode the metadata into the file itself is the main reason. Although I will also be running a samples database.
Hi Guys. Yes I agree that fastq could be replaced by bam. I'm not doing it since I don't want to add an additional step in the data processing (i.e. add more "bureaucracy") and some tools expect fastq anyway, but I would happily have sequencers spit out bams instead if fastq. I think at the Sanger Institute they produce bam instead of fastq by default. In any case, as Dan Gaston says, I would run a database.