I currently have 33 gzipped files in beagle format. The sum file size is 55871611132 (55Gb).
I am trying to concatenate these with the following script which should just take the header from a single file and then concatenate all files into a single file with one header.
cat <(zcat Chr1.beagle.gz \
| head -n 1) <(zcat *.beagle.gz \
| grep -v -w marker) | gzip > all.beagle.gz
However, the job has taken more than 24 hours at this point and the current file size is 207Gb. Should I be multithreading somehow or is the piping method I'm using inefficient?
You could use pigz, which is a parallel gzip.
That's what I was thinking, but does the syntax here look correct? I wonder if I should just wait it out until it finishes, but there's no way for me to assess progress since the file is being catted and gzipped simultaneously precluding me from using the grep function to assess progress.
For how long the process has been running? What is the size of the summed Beagle files, and what is the current size of the concatenated Beagle file? How many processors can you use? If I remember correctly, pigz scales linearly up to 4 threads, then there is some performance degradation.
You can estimate if it is worth restarting from scratch taking all the above into consideration.Also, did you check if the current compression is consuming CPUs, or if it stalled?
When I run sacct -j jobid I get the following suggesting its using 1 CPU:
Some comments:
your commands look ok. However, with so many
zcat
processes running in different subshells (for process substitutions), the whole command can't finish until all the files are decompressed, the header lines removed from each of them, and then all files merged and re-zipped. I am thinking that this can also hog your memory. Could you check the SWAP memory usage by running commandfree -h
. If that shows a lot ofused
SWAP memory, that might be the reason for the processes to run so slow.gzip can compress up to 95% - that is why your final file has 200+ GB in it as it has been collecting the uncompressed files data and the final compression will begin after all the files are decompressed.
A better strategy, in case of memory hog, is to decompress the files sequentially and send it sequentially to
all.beagle.gz
This will also keep the sequential order of the files in the final gz. In fact, if you don't care about the extra header line, then there is a very neat trick by which you can just cat the gzipped file to final fileThis works. See here: https://riptutorial.com/bash/example/23063/concatenate-gzipped-files
Just wondering, are you sure you need to concatenate and compress all the gzipped files in such a HUGEEEE gzip? Or if you are looking to just make an archive,
tar
is a better solution, and it allows you to pull only files of interest from the whole archive without untarring all of the archive.Another thing:
grep -v -w marker
will run on each line to compare the "marker" and then decide whether to keep/remove the line. This might be very costly in these files with millions of lines probably. If you are sure that the header is the first line, there are easier solution with sed/awk https://stackoverflow.com/questions/7318497/omitting-the-first-line-from-any-linux-command-output/7318550