Bfast And Mpileup - How Time Consuming?
6
1
Entering edit mode
13.6 years ago
Travis ★ 2.8k

Hi all,

I need to make an approximate prediction (in terms of days) on how long it would take to:

Align 150 samples each consisting of 10 million 100bp Illumina paired end reads to the human genome and run mpileup for each.

The machine I have at my disposal has 256 GB of RAM and 4x6 core 3500MHz processors for a total of 24 cores. The storage is a locally attached RAID5 SCSI.

Apologies if any pertinent details are missing - I am having to calculate these things in the absence of any practical experience!

Thanks in advance.

samtools snp hardware • 5.5k views
ADD COMMENT
2
Entering edit mode

Any chance you can run 1 sample to use as a baseline? The good news is you have plenty of memory and a scsi storage. A couple of tips that can boost up the speed: 1. Load your indexes and reference in memory. 2. create a filesystem with raid0 for bfast to perform scratches.

ADD REPLY
1
Entering edit mode

are those 150 barcoded, or do you have to process them 1 by 1? I won't be able to give you timings, but sure it will help anyone able to do so. also, knowing the processors' speed and the type of storage you have will definitely be of great help too.

ADD REPLY
1
Entering edit mode

My suggestion would be to run 1 million reads of one of your samples through your pipeline, then multiple by 1500 to get a ballpark estimate. In addition to getting timing estimates, I find this helps with debugging any issues to help the actual run be hands off.

ADD REPLY
0
Entering edit mode

Thanks guys. I have added the further detail to the question also.

ADD REPLY
0
Entering edit mode

Also - I cannot perform any tests at the minute as I am awaiting admin permission to deploy new software on our server!

ADD REPLY
0
Entering edit mode

Thanks for the tips but RE the testing one sample, the answer is above your question :)

ADD REPLY
0
Entering edit mode

are you decided on bfast or might you be using another aligner? The thing is that bfast can work in different modes of precision, eg. depending on the number of genome indices used. This would make much difference in terms of memory used and therefore in the number of processed that can be run in parallel on your machine.

ADD REPLY
0
Entering edit mode

I had basically decided to go with BFast for this one. The purpose is SNP/Indel discovery so fair accuracy is required.

ADD REPLY
5
Entering edit mode
13.6 years ago
Sophia ▴ 300

I ran several samples of 16 Mio reads each (SOLiD single end) using bfast with one index of hg19:

Using 4 processors on a machine with 24 GB RAM, each sample took aprox. 5 hours from reads to sam/bam. I was using GATK for the variant calls, though. Mpileup takes less than half an hour for one of these samples. As far as I experienced it, the limiting factor in bfast is RAM. Please take into consideration that for each instance of alignment that is run using bfast and its indices, at least 12 GB of RAM should be made available.

I ran one sample using 4 indices (1 primary and 3 secondaries), which took aprox. 8-9 hours on the same machine. Using more than one index does not proportionally increase processing time, since secondary indices only are used to align reads that did not align to the primary index, and it does not use more RAM since indices are used sequentially.

ADD COMMENT
0
Entering edit mode

I was planning to use the 10 indexes recommended for the human genome in the supporting documentation. Having x10 the RAM you used and x 6 the processors should hopefully make this achievable in reasonable time!

ADD REPLY
0
Entering edit mode

one more thing: make sure to use the -U option in the bfast postprocess step. Otherwise this step will run for days.

ADD REPLY
4
Entering edit mode
13.6 years ago
Nilshomer ▴ 100

I would also suggest that since you have so much RAM, you could use the "-l" option with "bfast match" to load in all the indices into memory at once, instead of having to process each serially and merge temporary files. Post back you results!

ADD COMMENT
0
Entering edit mode

Will do Nils. Are there any other settings you would recommend for a study of this type?

ADD REPLY
1
Entering edit mode
13.6 years ago

Well here is a guesstimate: 1 million reads over a single core might take about between 1 to 3 hours to process.

ADD COMMENT
0
Entering edit mode

So assuming the upper limit of 3 hours and dividing by 24 cores means approx 8 days.

ADD REPLY
0
Entering edit mode

I am actually curious to see how this turns out in practice - guesstimating can be notoriously off.

ADD REPLY
1
Entering edit mode
13.6 years ago
Travis ★ 2.8k

To update on this benchmark:

I generated one million 100bp paired end reads with bgeneratereads and aligned them to the human genome with bfast. I used all ten recommended indexes, each stored in RAM. I ran the test once with 16 cores and once 24 cores. There was no other server load when tests were run.

With 16 cores the test took 80 minutes. This included approx 23 minutes to read the indexes into memory.

WIth 24 cores it actually took longer. This took 95 minutes (including the same 23 minutes to read the indexes into RAM).

The command I used was

    bfast match -f hg19.fa -r ../TestReads/bgen.reads -l -k 22 -K 100 -M 500 -n 16 > aligned.bmf
ADD COMMENT
0
Entering edit mode
13.6 years ago
Travis ★ 2.8k

Nils - I will soon be running the test (indexes should be generated tomorrow).

I just wanted to re-check if there are any other alignment paramters you recommend?

Also, I just want to confirm that with my 24 core machine, the most cores I can utilize is 16 (4 to the power of 2)?

ADD COMMENT
0
Entering edit mode

only for the bfast index command, the number of threads must be a power of 2. For other commands, you can use eg. 20 or 24.

ADD REPLY
0
Entering edit mode

Thanks Sophia - I found this out myself a few minutes ago when I ran with 24. Unfortunately it actually seems to run slower with 24 cores than it did with 16!!!

ADD REPLY
0
Entering edit mode
13.6 years ago
Travis ★ 2.8k

BUMP

Does anyone have any suggestions as to why the benchmark might be slower with more threads?

ADD COMMENT

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6