Question

Best Practices For Genome Indexing With Bfast

6

Entering edit mode

14.4 years ago

Brad Chapman 9.7k

The [bfast] short read aligner has been receiving a number of positive reviews.

We're giving it an initial try and are currently stuck on indexing the genomes. For mm9, we are using the 10 masks suggested in the documentation and running each with:

bfast index -d 1 -n 4 -f mm9.fa -A 0 -w 14 -m <mask> -i <num>

The full Python fabric script with all the commands is here.

Each mask takes about 5 hours to process and uses up 12Gb of space, so we'll end up with 120Gb and 50 hours of process time to generate an index. Ideally, we'd be indexing 10+ different genomes we use and sharing this on Amazon, but that's a Tb of space and 3 weeks of process time, and double that if we make colorspace indexes available as well.

Is there a best practice everyone is using for improving the bfast space/time constraints? Can I get reasonable results with a smaller subset of masks? Am I missing parameters to improve processing time and compression? Any other tips from experienced bfast users?

alignment short aligner • 4.4k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 14.4 years ago by Brad Chapman 9.7k

Ram · Answer 1 · 2010-06-25

The 5 hour run time per index sounds about right.

It seems that the bfast method is strongly optimized towards speed at the cost of a one-time indexing. It is appropriate for the use cases where there is only one genome and lots of data would be mapped against it.

Your use case seems to be one that does not fit the strengths of bfast, thus perhaps a different method would be better suited.

For example SHRiMP is a tool that does not need to create an index.It works equally well for color-space and letter space data, albeit it is a lot slower than bfast. But if you have access to the cloud you have access to lots of CPUs thus you could split your problem into hundreds of pieces, and for that SHRiMP might work out very well.

Ram · Answer 2 · 2010-12-02

BFast is very flexible, so you can tailor the indexes to your use-case. If the read-lengths are long, try specifying your own, longer seeds. This will take less time and space than more shorter seeds.

There is a utility in the butil/ directory of the source tree called btestindexes that will find the "optimal" set of indexes to use given specified constraints for accuracy, mismatches, and key size/width.

That said, I agree with @Istvan that it's always good to consider other options such as bwa, bowtie (if you don't care about indels), or gsnap.

Actually, if your 10+ genomes are just different mm9 individuals and can be represented as a snp table, then they can be saved in a single gsnap index (see the section titled "SNP-tolerant alignment in GSNAP" in the readme).