Question

How do I use GNU parallel with two inputs?

1

Entering edit mode

2.1 years ago

fb143 ▴ 10

I am using GNU parallel on the following shortbred script. The script is supposed to take two input files and output one .tsv file but I am getting both _1.tsv and _2.tsv files and directories as an output. How do I get the output something like :

SRR059331_dir

SRR059331.tsv

SRR059339_dir

SRR059339.tsv

I did bash for loop and it outputs the above expected results but it is very time consuming. How do I convert this bash script into parallel. This is my bash script:

for i in test_files/*_1.fastq.gz; do
    F=`basename $i _1.fastq.gz`;
    mkdir test_out/"$F"_dir;
    python shortbred/shortbred_quantify.py --markers markers.faa --wgs "$F"_1.fastq.gz "$F"_2.fastq.gz --results test_out/"$F".tsv --tmp test_out/"$F"_dir --usearch ./shortbred/usearch;

done

This is my script so far for GNU parallel

#!/bin/bash

#Create a sub-directory for ShortBRED output
mkdir test_out

time parallel -j 10 \
"python shortbred/shortbred_quantify.py \
--markers markers.faa \
--wgs {1} {2} \
--results test_out/{1/.}.tsv \
--tmp test_out/{1/.}_dir \
--usearch shortbred/usearch" ::: test_files/fastq/*_1.fastq.gz :::+ test_files/fastq/*_2.fastq.gz

parallel GNU • 2.7k views

ADD COMMENT • link updated 2.0 years ago by ole.tange ★ 4.5k • written 2.1 years ago by fb143 ▴ 10

0

Entering edit mode

Someone will provide an exact answer but see if answer here helps in meantime: GNU parallel command with several multiple arguments

As noted always try --dry-run to see what parallel will use.

ADD REPLY • link 2.1 years ago by GenoMax 150k

0

Entering edit mode

Have you determined what the time consuming part of the process is? Spawning multiple instances of your python process may not be as economical as simply increasing the threads that usearch is using for example.

ADD REPLY • link 2.1 years ago by Joe 22k

score 5 · Answer 1 · 2023-04-03

5

Entering edit mode

2.1 years ago

Istvan Albert 102k

See the parallel manual on input sources:

https://www.gnu.org/software/parallel/parallel_tutorial.html#input-sources

for example:

parallel --link echo {1} and {2} ::: A B C ::: D E F

will print:

A and D
B and E
C and F

ADD COMMENT • link 2.1 years ago by Istvan Albert 102k

0

Entering edit mode

I think parallel --link is deprecated. I tried different methods but didn't work. I ended of concatenating to a single file and passed a single argument. Thank you both for your help.

ADD REPLY • link 2.1 years ago by fb143 ▴ 10

1

Entering edit mode

--link is fully supported, but what you may be thinking of is :::+ which links two inputs, and which is newer. --link will link all inputs and is thus less flexible.

Compare:

parallel --link echo ::: a b c ::: d e f ::: g h i
parallel echo ::: a b c :::+ d e f ::: g h i

ADD REPLY • link 2.0 years ago by ole.tange ★ 4.5k

0

Entering edit mode

What makes you think that the feature is deprecated?

It is listed on the help page I linked above (and below)

https://www.gnu.org/software/parallel/parallel_tutorial.html#input-sources

it is an essential feature and very handy at that - seems unlikely that it would be removed

ADD REPLY • link 2.1 years ago by Istvan Albert 102k

score 3 · Answer 2 · 2023-04-08

For readability I would always use a function:

doit() {
  python shortbred/shortbred_quantify.py \
    --markers markers.faa \
    --wgs "$1" "$2" \
    --results test_out/"$3".tsv \
    --tmp test_out/"$3"_dir \
    --usearch shortbred/usearch
}
export -f doit

time parallel -j10 --plus doit {} {/_1.fastq/_2.fastq} {/.} ::: test_files/fastq/*_1.fastq.gz