The [bfast] short read aligner has been receiving a number of positive reviews.
We're giving it an initial try and are currently stuck on indexing the genomes. For mm9, we are using the 10 masks suggested in the documentation and running each with:
bfast index -d 1 -n 4 -f mm9.fa -A 0 -w 14 -m <mask> -i <num>
The full Python fabric script with all the commands is here.
Each mask takes about 5 hours to process and uses up 12Gb of space, so we'll end up with 120Gb and 50 hours of process time to generate an index. Ideally, we'd be indexing 10+ different genomes we use and sharing this on Amazon, but that's a Tb of space and 3 weeks of process time, and double that if we make colorspace indexes available as well.
Is there a best practice everyone is using for improving the bfast space/time constraints? Can I get reasonable results with a smaller subset of masks? Am I missing parameters to improve processing time and compression? Any other tips from experienced bfast users?
Thanks Istvan for the confirmation that this is the expected behavior. You're right that it's probably not the right application for bfast. I'll give SHRiMP a try.