Question

Run Hundreds Of Bwa Commands Without Waiting

4

Entering edit mode

13.9 years ago

Bioscientist ★ 1.7k

Hi guys I'm analyzing some high-coverage trio data. So Need to run BWA for hundreds of fastq.gz files. Obviously I should write some script to finish such task without waiting and typing in hundreds of commands one by one. But as a beginner without coding experience, I don't know how to do.

For example, I just put

bwa aln -t 24 index file1>1.sam
bwa aln -t 24 index file2>2.sam
bwa aln -t 24 index file3>3.sam
...............
..............

into the script, and run it..........and it doesn't work at all. I know I must miss sth., say, the pathway for fastq files.

anyone can give any pattern about such script of executing multiple jobs? thx

bwa • 8.0k views

ADD COMMENT • link updated 13.9 years ago by Sean Davis 27k • written 13.9 years ago by Bioscientist ★ 1.7k

score 10 · Answer 1 · 2011-06-15

10

Entering edit mode

13.9 years ago

Farhat ★ 2.9k

GNU parallel could also help with something like this. This can be handled in a single line while allowing for multiprocessing with something like (untested)

parallel bwa aln -t 24 index {} ">" {.}.sam ::: file*

ADD COMMENT • link 13.9 years ago by Farhat ★ 2.9k

score 8 · Answer 2 · 2011-06-15

8

Entering edit mode

13.9 years ago

Aleksandr Levchuk 3.2k

Programming boils down to 2 organization things: variables and functions; and 2 action things: if-statements and for-loops.

All you need here is a for-loop and a variable (lets name it "i").

In Bash the code would be:

for i in `seq -w 3 111`; do
   echo "bwa aln -t 24 index file${i} > ${i}.sam"
done

Output:

bwa aln -t 24 index file003 > 003.sam
bwa aln -t 24 index file004 > 004.sam
bwa aln -t 24 index file005 > 005.sam
...
bwa aln -t 24 index file109 > 109.sam
bwa aln -t 24 index file110 > 110.sam
bwa aln -t 24 index file111 > 111.sam

To put this into a script and run it. Do this:

# Generate Script
for i in `seq -w 3 111`; do
   echo "bwa aln -t 24 index file${i} > ${i}.sam"
done > my_script.sssh

# Make script executable
chmod +x my_script.sssh

# Run script
./my_script.sssh

ADD COMMENT • link 13.9 years ago by Aleksandr Levchuk 3.2k

2

Entering edit mode

Why not then use seq -w 0010022 0010077 and "SRR$i"? I can tell that you haven't tried answering my "what happens when" questions.

ADD REPLY • link 13.9 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

What happens when you run seq -w 3 111 in Bash by itself? What happens when you run it without the -w?

ADD REPLY • link 13.9 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

thx guys.but actually the name of the fastq files here is not nubers like 1,2,3,4... but like SRR0010022, so seq -w 3 111 doesn't really work....

ADD REPLY • link 13.9 years ago by Bioscientist ★ 1.7k

score 6 · Answer 3 · 2011-06-15

You probably don't want to run hundreds of alignment jobs serially on a single-CPU machine, but neither do you want to launch hundreds of simultaneous jobs which will compete for CPU, and worse, exhaust memory. I tend to use a makefile to control parallelism. That is, I'd have a Makefile with a target like

%.sam: %.txt
     bwa aln -t 24 index $< > $@

I can then do something like

make -j 8 file-{3..111}.sam

which will issue the relevant bwa commands in parallel, limited to eight simultaneous jobs. Another advantage is of course that make won't re-create existing files.

If you want to do this in shell, it's a bit less flexible, but you can get away with

for a in {0..9}; do
  for b in {0..9}; do
     bwa aln -t 24 index file-$a$b.txt > file-$a$b.sam &
  done
  wait
done

where the inner loop will spawn jobs in the background (due to &) and wait will pause until each batch of ten jobs are finished, before launching the next ten.

score 2 · Answer 4 · 2011-06-15

a variation on Aleksandr's approach:

for f in {3..111};
  do i=`printf "%03d" "$f"`; 
  bwa aln -t 24 index file$i > $i.sam;
done

if you run an executable with the & it will run all the processes in the background simultaneously, which may overwhelm your server

for f in {3..111};
  do i=`printf "%03d" "$f"`; 
  bwa aln -t 24 index file$i > $i.sam &
done

score 1 · Answer 5 · 2011-06-15

1

Entering edit mode

13.9 years ago

Sean Davis 27k

Think about using a simple batching system such as SLURM or slightly more complicated Sun Grid Engine for your machine(s) if you are getting into second-gen sequencing analysis, even if on a single machine. It is quite liberating to simply throw jobs into a queue and let the batch system deal with the consequences. Naming jobs, deleting them, controlling resource utilization (reduced number of jobs running during the day, for example), tracking job progress are all benefits. Of course, you pay a price in added complexity, but we have found it to be worth it for our small group.