Hi all,
I need to make an approximate prediction (in terms of days) on how long it would take to:
Align 150 samples each consisting of 10 million 100bp Illumina paired end reads to the human genome and run mpileup for each.
The machine I have at my disposal has 256 GB of RAM and 4x6 core 3500MHz processors for a total of 24 cores. The storage is a locally attached RAID5 SCSI.
Apologies if any pertinent details are missing - I am having to calculate these things in the absence of any practical experience!
Thanks in advance.
Any chance you can run 1 sample to use as a baseline? The good news is you have plenty of memory and a scsi storage. A couple of tips that can boost up the speed: 1. Load your indexes and reference in memory. 2. create a filesystem with raid0 for bfast to perform scratches.
are those 150 barcoded, or do you have to process them 1 by 1? I won't be able to give you timings, but sure it will help anyone able to do so. also, knowing the processors' speed and the type of storage you have will definitely be of great help too.
My suggestion would be to run 1 million reads of one of your samples through your pipeline, then multiple by 1500 to get a ballpark estimate. In addition to getting timing estimates, I find this helps with debugging any issues to help the actual run be hands off.
Thanks guys. I have added the further detail to the question also.
Also - I cannot perform any tests at the minute as I am awaiting admin permission to deploy new software on our server!
Thanks for the tips but RE the testing one sample, the answer is above your question :)
are you decided on bfast or might you be using another aligner? The thing is that bfast can work in different modes of precision, eg. depending on the number of genome indices used. This would make much difference in terms of memory used and therefore in the number of processed that can be run in parallel on your machine.
I had basically decided to go with BFast for this one. The purpose is SNP/Indel discovery so fair accuracy is required.