Question

Bwa Option That Stops After N Minutes

3

Entering edit mode

13.4 years ago

2184687-1231-83- ★ 5.1k

I would like to know if it's possible, or if not, to propose it, the option of running bwa alignment for N minutes from a list of input sequences, that would produce output up until the point where N minutes have been used for the computation of the alignment of the initial fraction of query sequences. For example:

bwa bwasw -s 10 -N 120 target.prefix query.fa

This command will process 10 sequences at a time in query.fa and produce their output alignments up until the point where about 120 minutes have been used in the calculation. So a fraction of the sequences at the beginning of query.fa will be processed if they take longer, and the rest of the sequences won't be processed.

It may be that this can already be achieved by clever piping in and out of the input data. Any ideas?

Edit: the doalarm solution doesn't seem to work for me, since it returns no sequences if it's stopped before finishing. See below:

I run it here with different $sec values, and from 190seconds onwards, it gives all results for the 556 query sequences:

~/src/doalarm/doalarm-0.1.7/doalarm $sec \
~/src/bwa/latest/bwa/bwa bwasw -z 1000 -s 10 \
reference_genome.fa.gz querysequences.fa > /tmp/ex.$sec.sam

  0  /tmp/ex.100.sam
  0  /tmp/ex.110.sam
  0  /tmp/ex.120.sam
  0  /tmp/ex.130.sam
  0  /tmp/ex.140.sam
  0  /tmp/ex.150.sam
  0  /tmp/ex.160.sam
  0  /tmp/ex.170.sam
  0  /tmp/ex.180.sam
  0  /tmp/ex.182.sam
  0  /tmp/ex.184.sam
  0  /tmp/ex.186.sam
 142765  /tmp/ex.190.sam
 142765  /tmp/ex.200.sam

bwa feature • 6.0k views

ADD COMMENT • link updated 13.4 years ago by Damian Kao 16k • written 13.4 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

I am confused,why did you put a bounty? did the alarm version by DK not work? Did you test it?

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

@Michael Dondrup: It doesn't seem to work for me, see edit.

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

I think what we have seen until now is that, bwa (unlike e.g. samtools) is not unix pipe (|) friendly in the sense that you cannot really chunk the processing.

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

avilella, how variable is the time needed to map a sequence? can you make an empirical table which tells you that, say, mapping x sequences takes less than y minutes 95% of the times?

ADD REPLY • link 13.4 years ago by xenophiliuslovegood ▴ 180

score 6 · Answer 1 · 2011-12-11

I would not try to interrupt BWA, but rather create a script to feed the sequences to BWA for n minutes:

Create a Perl script that writes sequences to STDOUT for n minutes e.g. using Perl alarm as other answers have suggested. When you do this catch the signal or use some other way to ensure that the last sequence is written completely.

Create a FIFO using mkfifo as a query input for BWA.

mkfifo /tmp/my_query_fifo
./my_query_producer.pl -sec 7200 > /tmp/my_query_fifo

Elsewhere

bwa bwasw my_reference_prefix /tmp/my_query_fifo

When the producer exits, BWA will exit cleanly after writing its results.

Okay, here are my tests. First of all note that the bwasw chunksize parameter is not settable from the command line in bwa-0.6.1. When you see smaller chunks than 10^7 in the following output, it's because I've added this line to bwtsw2main.c:

case 'C': opt->chunk_size = atoi(optarg); break;

This gives an extra option -C <INT> to set chunk size. This is just for my covenience - the test still works on vanilla bwa, but you will see different numbers.

In a shell:

mkfifo wibble
bwa bwasw -C 10000 c1215.fasta wibble > out.sam

In my Emacs running sbcl under Slime:

(asdf:load-system :cl-genomic)
(in-package :bio-sequence-user)

(defparameter *f* (open "/Users/keith/wibble" :direction :output
                        :if-exists :append))

(with-seq-input (seqi "/Users/keith/c1215_reads_1.fastq" :fastq)
  (loop
     while (has-more-p seqi)
     do (write-fastq-sequence (next seqi) *f*))))

In the first shell I now see:

[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...
[bsw2_aln] read 286 sequences/pairs (10010 bp)...

and so on. I can repeat the Lisp command several times, each time getting more alignments. Finally I can do

(close *f*)

and in the first shell I see

[bsw2_aln] read 276 sequences/pairs (9660 bp)...
[main] Version: 0.6.1-r104
[main] CMD: bwa bwasw -C 10000 c1215.fasta wibble
[main] Real time: 691.224 sec; CPU: 1.208 sec

that is the incomplete last chunk being processed.

Istvan Albert · Answer 2 · 2011-12-07

3

Entering edit mode

13.4 years ago

Damian Kao 16k

You could do it with a shell command. Here is shell scripts that'll send a kill command after set amount of time: http://www.cyberciti.biz/faq/shell-scripting-run-command-under-alarmclock/

Why exactly do you want to do this though? Are you renting out time on someone else's server? There really is no guarantee that sequences that take longer to process will be at the beginning. I guess you can try to sort your sequence so longest sequences are at the beginning of the file. But even that is no guarantee depending on complexity of the sequence.

ADD COMMENT • link updated 12.6 years ago by Istvan Albert 102k • written 13.4 years ago by Damian Kao 16k

1

Entering edit mode

I was first thinking about kill but that is not a good idea. Killing the process might end in incomplete write operation of the output, e.g. files not closed properly, missing output, etc.

ADD REPLY • link 13.4 years ago by Michael 55k

1

Entering edit mode

No, it might be worth a try (+1). alarm in perl actually sends a SIGALARM after timeout, so all depends on bwa interpreting it correctly (closing, flushing buffers), which is not a kill signal. One could achive the same by using kill -14 <procid> on the commandline i suppose.

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

Yeah that's true. Perhaps bin the reads into smaller files and run them one after another. Kill the entire process after 120 minutes. Then remove the last bin of reads because it potentially can contain incomplete files. Then merge the other alignment bams.

ADD REPLY • link 13.4 years ago by Damian Kao 16k

score 1 · Answer 3 · 2011-12-07

1

Entering edit mode

13.4 years ago

Michael 55k

I could imagine the following script:

split the input file into chunks of appropriate size (e.g 1000-10000 sequences)
run the chunks through bwa in a loop
if time is used up, exit and concatenate the output

This is not very efficient though I think, and might have other effects (e.g. multiple SAM headers which need to be removed). So I would also question why you need this feature. Could it be an alternative to use the -t parameter to allow for parallel threads to decrease wall-clock time instead?

ADD COMMENT • link 13.4 years ago by Michael 55k

1

Entering edit mode

Agreed, try the alarm way then. If bwa is C and includes POSIX.h it could work.

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

Thanks @Michael Dondrup for the suggestion. I don't have hard data on this, but I think bwa bwasw would take some initial time in loading the target indexed genome each time, so as you say, it would not be efficient if this is done over and over for each chunk. Maybe someone has some numbers on how long it takes to load up the human genome up until the point where the first sequence is queried...

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

score 1 · Answer 4 · 2011-12-11

As reading the other peoples comments I agree that the reason WHY you want to do this is rather unclear. A proper/better fitting solution would come up if you could slightly elaborate a bit more on its exact purpose.

My go with this would be (assuming fastq but for fasta it would go as well with some mods) to chunck the input fastq file in lines of 4 (one sequence and quality) and pipe them to align using bwa capturing its output with an incremental output file name. Loop over the complete input file and merge all sam files afterwards. If one alignment fails/halts or takes for ever you only miss the latest sequence in processing. You could even use the bash timing functions to limit it to time.

Setting a 'trap { cmd } EXIT' in bash with the consequtive command/script (cmd) to merge and cleanup/move files will help in any (remote) recovery. If wanted, you can regain speed by speeding-up the process using bash 'PARALLEL' function (you can get it from gnu.org and its really powerfull). Parallelisation power is what you will loose by processing all sequences sequentially I guess.

My 2ct

score 1 · Answer 5 · 2012-09-27

If you're piping output to bwasw ... in a read-producer/consumer model, you should also use the unix "buffer" command.

We ran into this issue while developing a pipeline in which we were generating a continuous stream of long reads, but we wanted them to be alignment-filtered in real-time. bwasw is great for this, but it has to be buffered for paired-end reads... since it a "chunk" of reads from read1 and another "chunk" from read2.

EXAMPLE of using a buffer to accelerate pipes:

mkfifo buffer.in buffer.out
buffer -i buffer.in -o buffer.out &
head -1000 query.fq > buffer.in
bwa bwasw -s 10 -N 120 hg19.prefix buffer.out > output.250.sequences