Others have said you can do this with programming logic and Pierre has shown how GNU Parallel could be used to process the blocks. I'd like to expand on the idea of using GNU Parallel to process your blocks and do away with reading/writing the blocks to intermediary files. This way you can process your blocks in parallel and do away with disk IO for an already large file.
Lets create your mock input file:
$ echo -e "seq_1\tchr1\t12
seq_1\tchr2\t34
seq_1\tchr3\t57
seq_3\tchr1\t34
seq_3\tchr1\t26
seq_3\tchr4\t47
seq_4\tchr9\t78
seq_5\tchr8\t90
seq_5\tchr7\t77" > input.txt
Now lets look at inserting a record separator at the start of each block - defined here by whenever the first column changes its value. We'll use awk to do this and use "----" as the record separator as we don't expect to find this within our file:
$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt
----
seq_1 chr1 12
seq_1 chr2 34
seq_1 chr3 57
----
seq_3 chr1 34
seq_3 chr1 26
seq_3 chr4 47
----
seq_4 chr9 78
----
seq_5 chr8 90
seq_5 chr7 77
OK, we can now use GNU parallel to process our blocks and using ---- as the record start. We need to remove this line from each block being processed by GNU Parallel - we'll use --remove-rec-sep to achieve this:
$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt \
| parallel --gnu --keep-order --spreadstdin -N 1 --recstart '----\n' --remove-rec-sep cat
seq_1 chr1 12
seq_1 chr2 34
seq_1 chr3 57
seq_3 chr1 34
seq_3 chr1 26
seq_3 chr4 47
seq_4 chr9 78
seq_5 chr8 90
seq_5 chr7 77
Now we can start put in the my_cmd you want to run on each block using something like:
$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt \
| parallel --gnu --keep-order --spreadstdin -N 1 --recstart '----\n' --remove-rec-sep | my_cmd'
To test that GNU Parallel is processing each block separately, use awk to prefix each line with the number of lines encountered by each job:
$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt \
| parallel --gnu --keep-order --spreadstdin -N 1 --recstart '----\n' --remove-rec-sep | awk "{print FNR,\$0}"'
1 seq_1 chr1 12
2 seq_1 chr2 34
3 seq_1 chr3 57
1 seq_3 chr1 34
2 seq_3 chr1 26
3 seq_3 chr4 47
1 seq_4 chr9 78
1 seq_5 chr8 90
2 seq_5 chr7 77
Some advice on using pipelines within GNU Parallel:
- Each GNU Parallel job will use more than 1 core. In the above and 1 core for sed - reduce the number of jobs run in parallel to account for this.
- Using pipelines with single and double quotes can be a nightmare - consider moving it into a single function which you cann from GNU Parallel.
Instead of
sed -e "1d"
you can use--rrs
(--remove-rec-sep
).Nathan's observation on cores is not quite true. It is true that they will be run as separate processes, but unless they use exactly the same amount of compute time, they will not use a full core each. So the best advice is to measure. E.g. try with
-j100%
and-j50%
and see which is faster.Thank's Ole that's great info!! I've updated my answer to use `--remove-rec-sep` instead of the seperate `awk` command. Note, the doc describes `--remove-rec-sep`, `--rrs` and `--removerecsep` but no mention of `--remove-record-separators`.
this is a great explanation! and, I think it's a pattern that would be nice to encapsulate somehow (though I guess copy-pasting that awk isn't too bad).