Executing a python script parallel for mutiple files in directory
1
0
Entering edit mode
2.4 years ago
osiemen ▴ 30

I am using a python script provided by the DEXSeq package to count exons. I have to execute the same python script on 50 bam files in my directory. Currently I am doing this using a for loop, by iterating one by one. However this step takes too long. Is there a easy way to execute the same python script parallel for each file separately, so that I don't have to wait for each file to finish. I know this should be possible in bash, but I don't have any experience with it.

I am currently using the following code and for each file to finish it takes around 1hour:

#!/bin/bash
#$ -cwd
#$ -o $HOME/exonCount.out
#$ -e $HOME/exonCount.err
#$ -V
#$ -q all.q

for i in $( ls -v /mnt/RNA_seq/*.bam )
do
 x="$(basename $i | cut -d'.' -f1 )"
 pathToFiles=$i
  #run python code to count exons
  python3.8 /mnt/python_scripts/dexseq_count.py  -p yes -r pos -s no -f bam \
   /mnt/gff/gencodev26_DEXSeq.gff \
   $pathToFiles \
   /mnt/dexseq/${x}_ExonCount.out
done

P.S I can run it on more cores.

Thanks in advance!

DEXSeq RNA-Seq bash parallel • 1.7k views
ADD COMMENT
0
Entering edit mode

I did not realize you were on a cluster. SLURM offers arrays, probably the scheduler you use has something similar. That is probably preferred here.

ADD REPLY
0
Entering edit mode

Yes indeed but I have not being able to use SGE array job successfully yet

ADD REPLY
0
Entering edit mode

Looks like you are using SGE. So the trick here would be to use the for loop to submit independent SGE job for each BAM file. You should be able to create a qsub command with the necessary parameters to do so. The jobs would start in parallel (to the extent of what is allowed for your account in terms of resources, rest would pend but then complete over time).

ADD REPLY
0
Entering edit mode
2.4 years ago

If you want to distribute the jobs in multiple nodes you can use SGE Array Jobs. Mention the number of tasks based on the number of files you wish to process i.e. #$ -t 1-10 (for 10 files) and use the task id as an index to access the bam file name from a list.

ls /mnt/RNA_seq/*.bam > files.list

Example SGE script.

#!/bin/bash
#$ -N test
#$ -cwd
#$ -t 1-10
#$ -e logs/test.err 
#$ -o logs/test.out
​​
#Get n th bam file name / path
bam=$(awk 'NR==$SGE_TASK_ID' files.list)

To run the job in a single node with multiple files processed in parallel use GNU Parallel as suggested by ATpoint .

ADD COMMENT
0
Entering edit mode

Hi thanks for the input!

I actually tried the following based on your code but this doesnt seem to work:

#!/bin/bash
#$ -N test
#$ -cwd
#$ -t 1-3
#$ -e $HOME/test.err
#$ -o $HOME/test.out
#$ -q all.q@bla

#Get n th bam file name / path
BAM=$( awk 'NR==$SGE_TASK_ID' /mnt/home1/project/bamFiles.list )
#run python code to count exons
python3.8 /mnt/DEXSeq/python_scripts/dexseq_count.py -p yes -r pos -s no -f bam \
/mnt/nochr_gencodev29.gff \
$BAM  \
/mnt/xomics/osmana/dexseq/humandata/countData/RNA.$SGE_TASK_ID

Here it seems like dexseq doesnt get all the correct parameters and i guess it is because of $SGE_TASK_ID.

The error: .../python_scripts/dexseq_count.py: Error: Please provide three arguments

Is it because how I use the awk variable or $BAM ? Providing just a file name without $BAM or $SGE_TASK_ID seems to work just fine..

ADD REPLY
0
Entering edit mode

Can you try the following?

BAM=$(awk -v "line=$SGE_TASK_ID" 'NR==line {print $1}' /mnt/home1/project/bamFiles.list)
ADD REPLY

Login before adding your answer.

Traffic: 1653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6