Concatenation and Gzip Compression of Beagle Files Slow
1
0
Entering edit mode
4.3 years ago
selplat21 ▴ 20

I currently have 33 gzipped files in beagle format. The sum file size is 55871611132 (55Gb).

I am trying to concatenate these with the following script which should just take the header from a single file and then concatenate all files into a single file with one header.

cat <(zcat Chr1.beagle.gz \
| head -n 1) <(zcat *.beagle.gz \
| grep -v -w marker) | gzip > all.beagle.gz

However, the job has taken more than 24 hours at this point and the current file size is 207Gb. Should I be multithreading somehow or is the piping method I'm using inefficient?

sequencing next-gen unix bash • 2.5k views
ADD COMMENT
0
Entering edit mode

You could use pigz, which is a parallel gzip.

ADD REPLY
0
Entering edit mode

That's what I was thinking, but does the syntax here look correct? I wonder if I should just wait it out until it finishes, but there's no way for me to assess progress since the file is being catted and gzipped simultaneously precluding me from using the grep function to assess progress.

ADD REPLY
0
Entering edit mode

For how long the process has been running? What is the size of the summed Beagle files, and what is the current size of the concatenated Beagle file? How many processors can you use? If I remember correctly, pigz scales linearly up to 4 threads, then there is some performance degradation.

You can estimate if it is worth restarting from scratch taking all the above into consideration.Also, did you check if the current compression is consuming CPUs, or if it stalled?

ADD REPLY
0
Entering edit mode
  1. Process has been running 27 hours.
  2. Size of the summed Beagle.gz files is 55Gb. I don't know what the sum is ungzipped.
  3. The current size of the concatenated beagle.gz file is 217Gb.
  4. I can use 20 threads on this node, but I'm not sure how many threads are used by zcat (probably 1). I could use multiple nodes if need be, but I doubt it's necessary.

When I run sacct -j jobid I get the following suggesting its using 1 CPU:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6494632          beagle savio2_htc co_rosali+          1    RUNNING      0:0
ADD REPLY
0
Entering edit mode

Some comments:

  • your commands look ok. However, with so many zcat processes running in different subshells (for process substitutions), the whole command can't finish until all the files are decompressed, the header lines removed from each of them, and then all files merged and re-zipped. I am thinking that this can also hog your memory. Could you check the SWAP memory usage by running command free -h. If that shows a lot of used SWAP memory, that might be the reason for the processes to run so slow.

  • gzip can compress up to 95% - that is why your final file has 200+ GB in it as it has been collecting the uncompressed files data and the final compression will begin after all the files are decompressed.

  • A better strategy, in case of memory hog, is to decompress the files sequentially and send it sequentially to all.beagle.gz This will also keep the sequential order of the files in the final gz. In fact, if you don't care about the extra header line, then there is a very neat trick by which you can just cat the gzipped file to final file

    • cat a.gz b.gz c.gz > abc.gz

    This works. See here: https://riptutorial.com/bash/example/23063/concatenate-gzipped-files

Just wondering, are you sure you need to concatenate and compress all the gzipped files in such a HUGEEEE gzip? Or if you are looking to just make an archive, tar is a better solution, and it allows you to pull only files of interest from the whole archive without untarring all of the archive.

ADD REPLY
0
Entering edit mode

Another thing: grep -v -w marker will run on each line to compare the "marker" and then decide whether to keep/remove the line. This might be very costly in these files with millions of lines probably. If you are sure that the header is the first line, there are easier solution with sed/awk https://stackoverflow.com/questions/7318497/omitting-the-first-line-from-any-linux-command-output/7318550

ADD REPLY
1
Entering edit mode
4.3 years ago
h.mon 35k

Size of the summed Beagle.gz files is 55Gb. I don't know what the sum is ungzipped.

Something is wrong, I can't imagine how a sum of 55Gb for the compressed files would turn into one compressed file of ~4 times the original size.

In fact, I think you should kill this process before you run out of disk space, I think you entered an infinite loop, as all.beagle.gz has been globbed by zcat *.beagle.gz.

ADD COMMENT
0
Entering edit mode

Oh my gosh. You're right. This seems to work, thanks a bunch!

zcat Chr1.beagle.gz | head -n 1 | gzip > all.beagle.gz

for f in Chr*.beagle.gz ; do
    zcat $f | tail -n -1 | gzip >> $f.new &
done
wait

for f in Chr*.beagle.gz.new ; do
    cat $f >> all.beagle.gz
done
ADD REPLY
0
Entering edit mode

You can probably use your original command, provided the zcat globbing doesn't include the output file name. This should work:

cat <(zcat Chr1.beagle.gz \
  | head -n 1) <(zcat Chr*.beagle.gz \
  | grep -v -w marker) | gzip > all.beagle.gz
ADD REPLY

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6