Computing pairwise AA global sequence alignment between all pairs in a vector of sequences

0

Entering edit mode

21 months ago

rubic ▴ 270

Hi,

I have a long vector of AA sequences (~10,000) and I need to compute the score of the global sequence alignment between all possible pairs. This means ~10 million pairs, which shrink down to ~5 million because of redundancy.

My question is if there's a tool/package (preferably R but python will also work) that does that efficiently?

pairwiseAlignment • 1.1k views

ADD COMMENT • link 20 months ago by rubic ▴ 270

1

Entering edit mode

What have you tried so far? Was pairwiseAlignment (Biostrings) too slow?

ADD REPLY • link 21 months ago by liorglic ★ 1.5k

0

Entering edit mode

R's Biostrings's pairwiseAlignment is what I've tried so far but is impractical for the scale of operations I'm looking at: ~5M pairwise alignments.

ADD REPLY • link 21 months ago by rubic ▴ 270

1

Entering edit mode

OK, in this case, I think you have two options:
1) Parallelization - if you have access to a multi-core machine, or computer cluster, then you can divide the work into multiple processes or jobs. The exact way to do this will depend on the infrastructure on which you work.
2) Give up on some of the alignments. The question is - do you actually need all pairwise alignments? What if you just run a "all vs. all" blast, so each sequence finds its top matches? You didn't say what you're trying to do exactly, so I don't know if this is an option.

ADD REPLY • link 21 months ago by liorglic ★ 1.5k

0

Entering edit mode

Parallelization is the answer, along with hashing. There are a couple of python packages that might be suitable for this (e.g., scirpy and tcrdist), but I need to see how to operate them for my purpose.

ADD REPLY • link 20 months ago by rubic ▴ 270

Login before adding your answer.