I'm trying to store some RUNs generated by different Illumina sequencers. My idea is to compress them to keep all files together (and reduce size, but this is secondary)... however, because the big amount of files and because the global size can be enormous (~120GB!) I'm having some problems in the compression.
Just for curiosity, I guess all files within the BLC folder are essential... but could delete some of them? What're absolutely necessary for demultiplexing?
that means no compression - but then you also have tar czvf which also compresses the files.
So you see the conundrum here. You are probably starting with already compressed files (BCL files) then you compress those with tar z.
Then you pipe the output into another compressor, pigz -0 p 32, that has no effect other than adding another level of a gzip on top of a gzip of an already compressed file. So the data is compressed three times at this point. If anything, this process will make your archive larger than the original files.
Usually the BCL files are deleted after the conversion to FASTQ - once it has been established that the conversion has been successful.
In the past, people occasionally reran a conversion with different parameters to get better results, but nowadays, you'd only rerun the conversion if the conversion has been done incorrectly. For example, the sample sheet is incorrect.
Thus, in general one would not keep the BCL files around for long term.
If you belong to a core facility then you may be required to store a copy of the run depending on what kind of service agreement you provide to your customers to keep the data available. So keeping a tar archive of data folder (or at least fastq files) would be the way to go. BCL files are already compressed with newer/large sequencers and will not be amenable to further compression, if any.
What're absolutely necessary for demultiplexing?
Entire raw data folder will be required, if you wish to demux again using Illumina's bcl-convert or bcl2fastq.
I just noticed that you are using
that means no compression - but then you also have
tar czvf
which also compresses the files.So you see the conundrum here. You are probably starting with already compressed files (BCL files) then you compress those with
tar z
.Then you pipe the output into another compressor,
pigz -0 p 32
, that has no effect other than adding another level of a gzip on top of a gzip of an already compressed file. So the data is compressed three times at this point. If anything, this process will make your archive larger than the original files.