Question

bwa mem runs slowly the first time

0

Entering edit mode

10.2 years ago

Hatem Elshazly ▴ 60

Hi there,

I'm using bwa mem for alignment but I noticed this behavior which I don't fully understand:

I created large aws ec2 instances (60G RAM and 32 cores), installed bwa, downloaded the human reference (=3G), indexed it, the bwa command is very straightforward: bwa mem human_ref.fasta input.fq

What happens is that the first time I use the command it takes a long time. It doesn't output anything onto stdout or stderr but I noticed it is loading something in the RAM (I think its the reference index), after it loads 5G or so, bwa "runs" fast enough with respect to the small input size (in megas). This Scenario only happens the first time I run the command on the machine, any runs after that don't take such time and finish reasonably fast.

Is this is normal? Why doesn't bwa take such long time after the first run?

Any help is appreciated.

Thanks,
Shazly

ec2 bwa-mem • 7.4k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by Hatem Elshazly ▴ 60

3

Entering edit mode

I think you are right about the loading of the reference index into memory. Also, I seem to recall that the system bwa uses for memory mapping of the index allows it to be reused in subsequent runs. That's why things get faster after the first run. If you load your memory with something else between two runs, your second run should be slow, too.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by thackl ★ 3.0k

Ram · Answer 1 · 2015-05-18

0

Entering edit mode

10.2 years ago

donfreed ★ 1.6k

Before aligning reads, bwa must generate an index file (an FMD-index of the reference genome). The first time you run the command, the index is generated but subsequent runs can use the previously generated index file.

ADD COMMENT • link 10.2 years ago by donfreed ★ 1.6k

0

Entering edit mode

Thanks for the reply but Is this file saved in a tmp directory or something? I didn't find neither in the reference directory or the working directory.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by Hatem Elshazly ▴ 60

0

Entering edit mode

The index files should be in the same directory as the reference. They should have the same base as the reference, but should also have additional extensions.

For example, if your genome is human_g1k_v37.fasta. Bwa will generate human_g1k_v37.fasta.bwt and additional files.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by donfreed ★ 1.6k

0

Entering edit mode

Actually I incorrectly assumed BWA would generate the index files automatically if they are not present. I just checked and it will not, so I have no idea why BWA would run more slowly for the first run and more quickly on subsequent runs.

ADD REPLY • link 10.2 years ago by donfreed ★ 1.6k

0

Entering edit mode

Your data is probably cached somewhere after calling it for the first time. So the second time you reference the data, it is retrieved from the cache and not your HD. Thus faster.

ADD REPLY • link 5.1 years ago by karel • 0

Ram · Answer 2 · 2015-05-19

0

Entering edit mode

10.2 years ago

dariober 15k

HI- I seem to confirm what the OP refers to and what @thackl suggests in his/her comment:

Align a dummy sequence file with one read to mouse reference genome:

# First run:
time bwa mem /lustre/.../Mus_musculus_NCBI_v37/mmu.fa test.fa
...
real    0m8.466s
user    0m0.127s
sys    0m6.410s

# Second run
time bwa mem /lustre/.../Mus_musculus_NCBI_v37/mmu.fa test.fa
...
real    0m2.282s
user    0m0.116s
sys    0m2.129s

# Third run:
time bwa mem /lustre/.../Mus_musculus_NCBI_v37/mmu.fa test.fa
...
real    0m2.169s
user    0m0.100s
sys    0m2.041s

I tried on a couple of different nodes and the picture stays the same: first run ~4x slower then following runs.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by dariober 15k

0

Entering edit mode

Your data is probably cached somewhere after calling it for the first time. So the second time you reference the data, it is retrieved from the cache and not your HD. Thus faster.

ADD REPLY • link 5.1 years ago by karel • 0