GNU parallel error: Command line too long
3
1
Entering edit mode
6.5 years ago
FGV ▴ 170

Dear all,

I've been using GNU parallel for a while and it works quite well. However, I recently needed to run some very long commands and parallel complained that the command line was too long:

parallel: Error: Command line too long (223235 >= 131049) at input 0: cat /tmp/10110507/adadev34gv_213...

It is a bit weird since my shell seems supports commands longer than 131049:

$ getconf ARG_MAX
2621440

and had no trouble running a very long command:

$ perl -e 'print("true "."x"x10000000);' | bash
$ echo $?
0

Does anyone know why it is so low and/or how to change it? thanks,

gnu parallel • 9.1k views
ADD COMMENT
0
Entering edit mode

Is this bioinformatics related?

ADD REPLY
0
Entering edit mode

Well, it might not seem at first sight, but I am actually trying to concatenate several FASTA files and align them with MAFFT. :)

ADD REPLY
0
Entering edit mode

That's sufficient to be bioinformatics related, but since we are a "focussed" forum it would be best if you mention that application in your initial question to remove all doubt.

ADD REPLY
0
Entering edit mode

Can print the output of parallel --max-line-length-allowed

ADD REPLY
0
Entering edit mode

Sure, it matches with the error message:

$ parallel --max-line-length-allowed
131049
ADD REPLY
0
Entering edit mode

Can you build your command (or parts of it) programmatically, and echo that as a string to awk '{ print length($0) }' or wc -c etc.? This may help track down which command (or which parts) are too long for parallel.

ADD REPLY
0
Entering edit mode

As I said on another post, I actually have a script with all the commands that I pipe to parallel, and it is actually the first line that has 223188 characters. Also noticed that there are other lines that won't work either:

1020419
391943
223188
150854
146331

Strangely, I extracted the first command, turned it into an echo and piped it to bash, and it worked fine.... :/

ADD REPLY
1
Entering edit mode
6.4 years ago
FGV ▴ 170

I've been thinking a bit more about this and I think I've found a way around it using extglob.

Right now I have the commands like:

cat /long/path/file1 /long/path/file2 /long/path/file3 /long/path/file4 /long/path/file5 [...] /long/path/file10000 | mafft
cat /long/path/file10001 /long/path/file10002 /long/path/file10003 /long/path/file10004 /long/path/file10005 [...] /long/path/file20000 | mafft

But if I use extglob I could have them much shorter:

cat /long/path/file?(1|2|3|4|5 ... |10000) | mafft
cat /long/path/file?(10001|10002|10003|10004|10005 ... |20000) | mafft

So I tried running it through GNU parallel:

shopt -s extglob
. `which env_parallel.bash`
cat file_with_commands.sh | env_parallel [...]

and it seems to work. :)

thanks all for your help...

ADD COMMENT
0
Entering edit mode

presumably sooner or later the cat line itself will be too long. A lot of commands accept an argument, or a file that contains arguments eg grep -f {file}, curl -F abc=@file. Sadly cat doesn't seem to be one of them.

You could stack another parallel:

echo "parallel -k -a file_of_paths cat | mafft" | parallel

it would probably be fairly straightforward to make mafft take a file of paths parameter also, and more reliable!

ADD REPLY
5
Entering edit mode
6.5 years ago
ole.tange ★ 4.5k

GNU Parallel pessimistically assumes all characters have to be quoted. For this reason the max line length is half of what you would otherwise expect.

I have a file with the commands to run (several thousand) and I pipe it to parallel. It seems one of these commands is way too big...

A command line > 10000 chars - even a generated one - is highly unusual. GNU Parallel normally only hit that limit when copying a big environment (using env_parallel).

Try this to identify the long lines:

grep -E '.{100000}' file_with_commands

If they cannot be written shorter, then you can use this workaround: Give each line on stdin to bash one by one:

cat file_with_commands | parallel --pipe -N1 bash

The biggest disadvantage is that --joblog will not make sense, but if you do not use that, then this solution should be OK.

ADD COMMENT
0
Entering edit mode

But even if GNU parallel assumes quotes, the maximum argument length is still quite low. According to getconf, I should be able to use 2'621'440 characters (see post above). Why is GNU parallel limit 20 times lower than that?

It seems I have 5 commands with length greater than 100'000 characters. Is there any way to increase or disable this check? thanks,

ADD REPLY
0
Entering edit mode

The problem is in execve, which has the 128KB limit. In other words: It is not the same limit as you see in getconf ARG_MAX.

ADD REPLY
0
Entering edit mode

OK, does that mean that there is no way to increase the execve limit?

What about making GNU parallel more optimistic (and not assume all characters have to be quoted)? :) Would it be possible to have an option for this?

thanks,

ADD REPLY
0
Entering edit mode

I have found no way to increase the execve limit.

ADD REPLY
0
Entering edit mode

What about making it more optimistic? :)

ADD REPLY
0
Entering edit mode

Is the command itself too large, or is it the list of arguments/files that you're passing to it that is exceeding the limit?

ADD REPLY
0
Entering edit mode

It is the list of arguments that is too large.

ADD REPLY
0
Entering edit mode

Can you chunk your file list using split or similar? Or is it required for all of the arguments to be in that command?

ADD REPLY
0
Entering edit mode

Well, the command is basically a cat of several files and then piped into MAFFT. I guess I could split the cat into several cats, and pipe it at the end.. but that is a bit error-prone and I'd like to avid it if possible.

ADD REPLY
0
Entering edit mode

Why not cat all the files beforehand, and pass the file either directly to MAFFT, or via STDIN (if you have your heart set on piping)?

A workaround for cating more files than the commandline can handle would be to build up a list of the files using find and then -exec, then simply tell it to append the files in the list. You can probably do this with xargs too if you want parallelisation of some form.

ADD REPLY
0
Entering edit mode

But it is exactly the cat that breaks the limit because I am doing it on several thousand files. I guess I could do the cat directly on the terminal (no parallel), and then use parallel to run all the alignments since these are the time intensive steps...

ADD REPLY
0
Entering edit mode

Yeah so your problem is not with parallel, it's with the Unix cli limit, so you need to be a little cleverer about how you're doing it.

Besides, concatenating 10,000 files single line files, is the same as concatenating 10 x 1000 line files.

I would use find to build up the list and do the concatenation so that you have a single file ready to go, if you don't want to do the chunking of files manually:

https://unix.stackexchange.com/questions/76418/concatenating-thousands-of-files-vs

ADD REPLY
0
Entering edit mode

Hmmm, I think it is a parallel issue (or rather execve), since I can run the commands directly on the terminal.

From what I understood, parallel uses execve to run the commands, and that has a much smaller buffer (apparently 20x smaller) than the terminal limit (seen as getconf ARG_MAX).

ADD REPLY
0
Entering edit mode

Perhaps you're right, but I think my point still stands. I think to expect a significant change in how parallel handles CLI args is wishful thinking (especially for something that is a little bit of an edge case), so you'd be better off coming up with a robust way to get around this. There are loads and loads of threads about the fastest/best way of concatenating large numbers of files etc, so I really would strongly advise you to just rethink your process before you get as far as parallel.

ADD REPLY
0
Entering edit mode
6.5 years ago

First solution that comes to my mind is to put parts of it in a bash script, e.g do_stuff.sh

INPUT=$1
VARIABLE=$2
OUTPUT=$3
command_1  $INPUT | command_2 | command_3 $VARIABLE > $OUTPUT

and then use that script with parallel:

ls *.fastq | parallel -j 8 'do_stuff.sh {} foobar {.}_output

Don't know if that fits what you are doing

ADD COMMENT
0
Entering edit mode

That is actually what I am doing... I have a file with the commands to run (several thousand) and I pipe it to parallel. It seems one of these commands is way too big...

ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6