Storage of all SAM/BAM and VCF files generated in a pipeline
1
0
Entering edit mode
8.2 years ago

Hi! I work with data compression, but don't have a lot of experience with running genomic analysis pipelines like GATK and want to get to the bottom of genomic data storage issue. In one of the GATK pipelines that I ran, multiple BAM/SAM files were generated during pre-processing and then several VCF files were also generated for variant discovery. Most of these were only used in the immediate next step of the pipeline. Is it common practice to store all of these files, in case a need arises to revisit some of the intermediate results in the pipeline? Any information about this is very much appreciated, thank you!

pipeline sam bam vcf • 2.0k views
ADD COMMENT
1
Entering edit mode

I generally use gnu Make which deletes intermediate files saving space. It can always be rerun to regenerate results. Saving intermediate files just sometimes requires too much space. If I feel that some intermediate files are too important, storing them in a cram format is a good option. I also tend to save raw bam produced by the aligner and delete all other intermediate files. Also Make can be made to not delete chosen intermediate files.

ADD REPLY
2
Entering edit mode
8.2 years ago
igor 13k

That really depends on your computing infrastructure and how much data you will be processing. If space and compute resources are not a concern, keep everything.

The most important thing is that you keep the code and software used to generate the files. Note the versions of everything. When you make any changes, make sure the original versions are saved somewhere. That way, you can always generate all the files again.

ADD COMMENT

Login before adding your answer.

Traffic: 2614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6