Hi! I work with data compression, but don't have a lot of experience with running genomic analysis pipelines like GATK and want to get to the bottom of genomic data storage issue. In one of the GATK pipelines that I ran, multiple BAM/SAM files were generated during pre-processing and then several VCF files were also generated for variant discovery. Most of these were only used in the immediate next step of the pipeline. Is it common practice to store all of these files, in case a need arises to revisit some of the intermediate results in the pipeline? Any information about this is very much appreciated, thank you!
I generally use gnu Make which deletes intermediate files saving space. It can always be rerun to regenerate results. Saving intermediate files just sometimes requires too much space. If I feel that some intermediate files are too important, storing them in a cram format is a good option. I also tend to save raw bam produced by the aligner and delete all other intermediate files. Also Make can be made to not delete chosen intermediate files.