Entering edit mode
8.5 years ago
sor.hub.lennart
▴
10
I'd like to know if you've any idea on how to BLAST hundreds of thousands of 10-mers to BLAST human reference genome. I've tried blast+ with short reads:
blastn -task "blastn-short" -query original2.fa -db database/GRCh38_full_analysis_set_plus_decoy_hla.fa
but standalone blast is simply too slow. I need a much faster algorithm, preferably in parallel, on the cloud.
I've tried ruffus as well but it seems that ruffus needs to create a segment-file for each kmer, which isn't reasonable to do. http://www.ruffus.org.uk/examples/bioinformatics/part1_code.html#examples-bioinformatics-part1-code
Any ideas?
What exactly are you trying to do? It may be easier to enumerate all 10mers from human genome and then compare them to your list.
Perhaps I missed an earlier, less detailed version of the post so forgive me for asking, but what is unclear? From the post, "how to BLAST hundreds of thousands of 10-mers to BLAST [sic] human reference genome." sor.hub.lennart even showed the exact command and explained alternatives they have tried. I don't see how that could be more clear. This is not the approach I would use for the task, but the question itself and the task seem pretty obvious (simply comparing 10-mers to a reference).
My intent in asking that question was to see if @sor.hub.lennart is trying to find a solution for a problem that is not included in the original post (the method/commands being employed are clearly described). If there is indeed a larger question then there may be a different tool/software besides blast that would be the right answer.
If the task is straightforward as it is written in the original post then there is no good option but to brute force parallelize the search by breaking the query into tens/hundreds of files and running the blast in parallel. There are no free/true parallel implementations of blast+ (I believe there may be one commercial parallel implementation but there may be other options I do not know about).
Thanks for the response, I was just curious if I missed something. Quite often there is a direct answer for the technical question but it is not the correct approach for the biological question. This one seemed pretty cut-and-dry though (on the surface, it is about trying to solve a large-scale matching problem) so I had to see if there was something I was missing. If BLAST-like output is critical for the analysis or pipeline, I'd try to solve the problem of parallelizing the task, otherwise I'd use a k-mer based approach with an appropriate threshold. I'll add that as an answer if there's not more to it. Cheers.