Question

Bfast And Mpileup - How Time Consuming?

1

Entering edit mode

14.0 years ago

Travis ★ 2.9k

Hi all,

I need to make an approximate prediction (in terms of days) on how long it would take to:

Align 150 samples each consisting of 10 million 100bp Illumina paired end reads to the human genome and run mpileup for each.

The machine I have at my disposal has 256 GB of RAM and 4x6 core 3500MHz processors for a total of 24 cores. The storage is a locally attached RAID5 SCSI.

Apologies if any pertinent details are missing - I am having to calculate these things in the absence of any practical experience!

Thanks in advance.

samtools snp hardware • 5.9k views

ADD COMMENT • link 14.0 years ago by Travis ★ 2.9k

2

Entering edit mode

Any chance you can run 1 sample to use as a baseline? The good news is you have plenty of memory and a scsi storage. A couple of tips that can boost up the speed: 1. Load your indexes and reference in memory. 2. create a filesystem with raid0 for bfast to perform scratches.

ADD REPLY • link 14.0 years ago by Drio ▴ 920

1

Entering edit mode

are those 150 barcoded, or do you have to process them 1 by 1? I won't be able to give you timings, but sure it will help anyone able to do so. also, knowing the processors' speed and the type of storage you have will definitely be of great help too.

ADD REPLY • link 14.0 years ago by Jorge Amigo 14k

1

Entering edit mode

My suggestion would be to run 1 million reads of one of your samples through your pipeline, then multiple by 1500 to get a ballpark estimate. In addition to getting timing estimates, I find this helps with debugging any issues to help the actual run be hands off.

ADD REPLY • link 14.0 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thanks guys. I have added the further detail to the question also.

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

0

Entering edit mode

Also - I cannot perform any tests at the minute as I am awaiting admin permission to deploy new software on our server!

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

0

Entering edit mode

Thanks for the tips but RE the testing one sample, the answer is above your question :)

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

0

Entering edit mode

are you decided on bfast or might you be using another aligner? The thing is that bfast can work in different modes of precision, eg. depending on the number of genome indices used. This would make much difference in terms of memory used and therefore in the number of processed that can be run in parallel on your machine.

ADD REPLY • link 14.0 years ago by Sophia ▴ 300

0

Entering edit mode

I had basically decided to go with BFast for this one. The purpose is SNP/Indel discovery so fair accuracy is required.

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

score 5 · Answer 1 · 2011-05-10

5

Entering edit mode

14.0 years ago

Sophia ▴ 300

I ran several samples of 16 Mio reads each (SOLiD single end) using bfast with one index of hg19:

Using 4 processors on a machine with 24 GB RAM, each sample took aprox. 5 hours from reads to sam/bam. I was using GATK for the variant calls, though. Mpileup takes less than half an hour for one of these samples. As far as I experienced it, the limiting factor in bfast is RAM. Please take into consideration that for each instance of alignment that is run using bfast and its indices, at least 12 GB of RAM should be made available.

I ran one sample using 4 indices (1 primary and 3 secondaries), which took aprox. 8-9 hours on the same machine. Using more than one index does not proportionally increase processing time, since secondary indices only are used to align reads that did not align to the primary index, and it does not use more RAM since indices are used sequentially.

ADD COMMENT • link 14.0 years ago by Sophia ▴ 300

0

Entering edit mode

I was planning to use the 10 indexes recommended for the human genome in the supporting documentation. Having x10 the RAM you used and x 6 the processors should hopefully make this achievable in reasonable time!

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

0

Entering edit mode

one more thing: make sure to use the -U option in the bfast postprocess step. Otherwise this step will run for days.

ADD REPLY • link 14.0 years ago by Sophia ▴ 300

score 4 · Answer 2 · 2011-05-10

4

Entering edit mode

14.0 years ago

Nilshomer ▴ 100

I would also suggest that since you have so much RAM, you could use the "-l" option with "bfast match" to load in all the indices into memory at once, instead of having to process each serially and merge temporary files. Post back you results!

ADD COMMENT • link 14.0 years ago by Nilshomer ▴ 100

0

Entering edit mode

Will do Nils. Are there any other settings you would recommend for a study of this type?

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

score 1 · Answer 3 · 2011-05-10

1

Entering edit mode

14.0 years ago

Istvan Albert 102k

Well here is a guesstimate: 1 million reads over a single core might take about between 1 to 3 hours to process.

ADD COMMENT • link 14.0 years ago by Istvan Albert 102k

0

Entering edit mode

So assuming the upper limit of 3 hours and dividing by 24 cores means approx 8 days.

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

0

Entering edit mode

I am actually curious to see how this turns out in practice - guesstimating can be notoriously off.

ADD REPLY • link 14.0 years ago by Istvan Albert 102k

score 1 · Answer 4 · 2011-05-19

To update on this benchmark:

I generated one million 100bp paired end reads with bgeneratereads and aligned them to the human genome with bfast. I used all ten recommended indexes, each stored in RAM. I ran the test once with 16 cores and once 24 cores. There was no other server load when tests were run.

With 16 cores the test took 80 minutes. This included approx 23 minutes to read the indexes into memory.

WIth 24 cores it actually took longer. This took 95 minutes (including the same 23 minutes to read the indexes into RAM).

The command I used was

    bfast match -f hg19.fa -r ../TestReads/bgen.reads -l -k 22 -K 100 -M 500 -n 16 > aligned.bmf

score 0 · Answer 5 · 2011-05-18

0

Entering edit mode

14.0 years ago

Travis ★ 2.9k

Nils - I will soon be running the test (indexes should be generated tomorrow).

I just wanted to re-check if there are any other alignment paramters you recommend?

Also, I just want to confirm that with my 24 core machine, the most cores I can utilize is 16 (4 to the power of 2)?

ADD COMMENT • link 14.0 years ago by Travis ★ 2.9k

0

Entering edit mode

only for the bfast index command, the number of threads must be a power of 2. For other commands, you can use eg. 20 or 24.

ADD REPLY • link 14.0 years ago by Sophia ▴ 300

0

Entering edit mode

Thanks Sophia - I found this out myself a few minutes ago when I ran with 24. Unfortunately it actually seems to run slower with 24 cores than it did with 16!!!

ADD REPLY • link 14.0 years ago by Travis ★ 2.9k

score 0 · Answer 6 · 2011-05-24

0

Entering edit mode

14.0 years ago

Travis ★ 2.9k

BUMP

Does anyone have any suggestions as to why the benchmark might be slower with more threads?

ADD COMMENT • link 14.0 years ago by Travis ★ 2.9k