I am using a python script provided by the DEXSeq package to count exons. I have to execute the same python script on 50 bam files in my directory. Currently I am doing this using a for loop, by iterating one by one. However this step takes too long. Is there a easy way to execute the same python script parallel for each file separately, so that I don't have to wait for each file to finish. I know this should be possible in bash, but I don't have any experience with it.
I am currently using the following code and for each file to finish it takes around 1hour:
#!/bin/bash
#$ -cwd
#$ -o $HOME/exonCount.out
#$ -e $HOME/exonCount.err
#$ -V
#$ -q all.q
for i in $( ls -v /mnt/RNA_seq/*.bam )
do
x="$(basename $i | cut -d'.' -f1 )"
pathToFiles=$i
#run python code to count exons
python3.8 /mnt/python_scripts/dexseq_count.py -p yes -r pos -s no -f bam \
/mnt/gff/gencodev26_DEXSeq.gff \
$pathToFiles \
/mnt/dexseq/${x}_ExonCount.out
done
P.S I can run it on more cores.
Thanks in advance!
Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
I did not realize you were on a cluster. SLURM offers arrays, probably the scheduler you use has something similar. That is probably preferred here.
Yes indeed but I have not being able to use SGE array job successfully yet
Looks like you are using SGE. So the trick here would be to use the for loop to submit independent SGE job for each BAM file. You should be able to create a
qsub
command with the necessary parameters to do so. The jobs would start in parallel (to the extent of what is allowed for your account in terms of resources, rest would pend but then complete over time).