I'm within a lab that has performed a lot of sequencing. We've sequenced a lot of samples, sequencing many over time, on different platforms (Illumina, PacBio), with different read lengths (50-150bp), with different library preparation methods (Tagmentation, Sonication), using different sequencing centers, and different types of sequencing (RNA-Seq, WGS, small RNA, etc).
At this point, we have terabytes of sequencing data and it's beginning to become unwieldy. I have a variety of questions surrounding how to manage sequence data and how to use it.
- How do you organize your sequencing data?
- What file structure do you use?
- How do you retrain metadata regarding the sequencing data?
- How do you backup your sequence data?
- How do FASTQs get fed into your bioinformatic pipelines?
It would be great to get a discussion going in this regard. I haven't seen too many questions on this subject (please direct me if I am mistaken!), but I imagine it is a problem many have to deal with.
Thanks!
This discussion can be broadly split into two categories.
Requirements for these two are going to be very different. So it may be best to include in your answer which category you are referring to.
Use NCBI SRA as your backup (in addition to other things, tape, disks, cloud etc)? Let NCBI take care of it. Even unpublished data can be uploaded under an embargo until the publication is out.
see also this old post How Do You Manage Your Files & Directories For Your Projects ? (8 years ago ?!)