I am using GNU parallel on the following shortbred script. The script is supposed to take two input files and output one .tsv file but I am getting both _1.tsv and _2.tsv files and directories as an output. How do I get the output something like :
SRR059331_dir
SRR059331.tsv
SRR059339_dir
SRR059339.tsv
I did bash for loop and it outputs the above expected results but it is very time consuming. How do I convert this bash script into parallel. This is my bash script:
for i in test_files/*_1.fastq.gz; do
F=`basename $i _1.fastq.gz`;
mkdir test_out/"$F"_dir;
python shortbred/shortbred_quantify.py --markers markers.faa --wgs "$F"_1.fastq.gz "$F"_2.fastq.gz --results test_out/"$F".tsv --tmp test_out/"$F"_dir --usearch ./shortbred/usearch;
done
This is my script so far for GNU parallel
#!/bin/bash
#Create a sub-directory for ShortBRED output
mkdir test_out
time parallel -j 10 \
"python shortbred/shortbred_quantify.py \
--markers markers.faa \
--wgs {1} {2} \
--results test_out/{1/.}.tsv \
--tmp test_out/{1/.}_dir \
--usearch shortbred/usearch" ::: test_files/fastq/*_1.fastq.gz :::+ test_files/fastq/*_2.fastq.gz
Someone will provide an exact answer but see if answer here helps in meantime: GNU parallel command with several multiple arguments
As noted always try
--dry-run
to see whatparallel
will use.Have you determined what the time consuming part of the process is? Spawning multiple instances of your python process may not be as economical as simply increasing the threads that
usearch
is using for example.