I need to run an all-by-all BLASTp on a large dataset of ~ 2 million protein sequences.
I see that there are 2 routes that folks have employed in the past. And some related posts are here at Correct Method To Blast All-Vs-All With Ncbiblast & How To Speed It Up? or elsewhere at http://seqanswers.com/forums/showthread.php?t=5752 etc
Route 1: Split input files and then run BLAST on these smaller chunks
Route 2: Use comparable tool such as open source mpiBLAST
Are these the only practical routes for large BLAST runs or are there other related / unrelated ways to go about it?
And finally is
Route 3: Both splitting input files AND using mpiBLAST a sound idea? If not, why not?
Thanks for your answers
I moved this from forum since there is a clear question. In my opinion, use route 1. I did not see any major improvements with mpiBLAST and it is more difficult to configure and use. Splitting the input and doing blast in parallel should be easy to implement on any system.