Entering edit mode
15 months ago
rubic
▴
270
Hi,
I have a long vector of AA sequences (~10,000) and I need to compute the score of the global sequence alignment between all possible pairs. This means ~10 million pairs, which shrink down to ~5 million because of redundancy.
My question is if there's a tool/package (preferably R
but python
will also work) that does that efficiently?
What have you tried so far? Was
pairwiseAlignment
(Biostrings) too slow?R
'sBiostrings
'spairwiseAlignment
is what I've tried so far but is impractical for the scale of operations I'm looking at: ~5M pairwise alignments.OK, in this case, I think you have two options:
1) Parallelization - if you have access to a multi-core machine, or computer cluster, then you can divide the work into multiple processes or jobs. The exact way to do this will depend on the infrastructure on which you work.
2) Give up on some of the alignments. The question is - do you actually need all pairwise alignments? What if you just run a "all vs. all" blast, so each sequence finds its top matches? You didn't say what you're trying to do exactly, so I don't know if this is an option.
Parallelization is the answer, along with hashing. There are a couple of python packages that might be suitable for this (e.g., scirpy and tcrdist), but I need to see how to operate them for my purpose.