Question

Massive Pairwise Comparison Using Biopython

1

Entering edit mode

13.5 years ago

Peter ▴ 10

Hi,

I have a data-set of ~7500 sequences, avg. length ~1700 bases. I need to perform pairwise analysis on the entire set. I have a biopython script to perform this analysis in parallel. My understanding is that the comparison will need to run on an MPI cluster. What are my options for doing this and where could I run the job?

Thanks,

Peter

biopython pairwise • 13k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 13.5 years ago by Peter ▴ 10

3

Entering edit mode

What tool do you want to use to do the comparisons, or is it something you have written yourself?

ADD REPLY • link 13.5 years ago by Sean Davis 27k

0

Entering edit mode

Hi Peter, I actually work on exactly the same problem. I use mpi4py library and perform pairwise alignment using muscle, clustalw or mafft. My code runs on the cluster, though 2000 sequences takes about 5 hours on 40 cores. Yes you can also use pairwise2 from biopython. If you are still interested please me an email at mubrejnev@gmail.com, so that we can chat about it. Brejnev

ADD REPLY • link 11.6 years ago by mubrejnev • 0

score 5 · Answer 1 · 2011-09-11

As others have commented, you did not mention what types of "pairwise comparison" you are doing [EDIT: just notice that this is a very old question. Anyway...]. I do not know biopython enough about if the part you use is based on native machine code or written in Python. If for your task a pure python module takes most of CPU time, you should seek solutions in C/C++ or Java, which can be 10-100X faster for compute intensive tasks such as Smith-Waterman alignment. If you are actually using Smith-Waterman for comparison, SSE2-based implementation can give you a further 30X speed up, or >1000X speed up over a pure python implementation.

Think about that: you can use your laptop to finish a task that requires a moderate cluster.

EDIT: To support my claim, I did a naive benchmark. The take-home message is biopython might be 10000X slower than a good SW implementation. Sometimes choosing the right algorithm and/or implementation can make a huge difference.

The detailed result is here: [?] Input: human mitochondrial DNA (MT.fa) vs. 2000 150bp reads simulated from it (q.fa) =================================================================================================== Program Time (s) Mem (MB) Command-line

cross_match 0.28 104.280 cross_match q.fa MT.fa -matrix mat-blast # [not SW; time/2] ksw-s1 1.25 0.720 ksw -fs1 MT.fa q.fa # [no aln; query up to 255bp] ksw-s2 2.05 0.720 ksw -fs2 MT.fa q.fa # [SSE2; no aln; 2nd best] exonerate 4.64 14.736 exonerate -m affine:local -S n -d EDNABLAST -o -7 -e -2 q.fa MT.fa # [not SW; time/2] bwa-stdsw 29.66 0.976 bwa stdsw -f MT.fa q.fa # [buggy in rare cases] swat 45.58 5.264 swat q.fa MT.fa -M mat-blast -N 1 -raw # [custom matrix] water 144.52 24.696 water -as MT.fa -bs q.fa -da EDNABLAST -gapo 7 -gape 2 -ou 1 biopython ~22752.8 ~620.784 # [able to output multi opt aln but disabled] =================================================================================================== [?]

Here are some comments on each implementation:

cross_match from phrap-080730. Cross_match is not a standard SW implementation, but for intra-species alignment, it should give identical alignment to SW most of time. Cross_match aligns both strands. I halve its CPU time as others are configured to align the forward strand only. Cross_match also finds all the high-score hits, while others do not.
swat from phrap. This is largely believed to be the most efficient SW implementation before the invention of SSE2-based ones.
exonerate, non-exhaustive mode (not a standard SW).
bwa-stdsw. This is a component of BWA. It uses tricks inspired by swat. Stdsw implements a linear-space SW, but I think it is still buggy in very rare cases. The alignment may be slightly different from the optimal. This is actually the first data set where stdsw shows better performance than swat; swat is always faster for other data sets.
water from EMBOSS.
Biopython. By default, this implementation does more than others in this benchmark in that it finds all the optimal alignments having the same alignment score. I switched off this feature. The core routines are implemented in C so far as I can tell.
ksw. My experimental SSE2-based implementation. It computes score, the end coordinates of the optimal alignment and the 2nd best score. It does not give the final alignment or the start coordinates.

Biopython script

from Bio import pairwise2
from Bio import SeqIO
import sys
for t in SeqIO.parse(sys.argv[1], "fasta"):
    for q in SeqIO.parse(sys.argv[2], "fasta"):
        a = pairwise2.align.localms(t.seq, q.seq, 1, -3, -7, -2, one_alignment_only=1, score_only=1)
        print t.id, "\t", q.id, "\t", a

score 1 · Answer 2 · 2011-05-21

1

Entering edit mode

13.5 years ago

Adam ▴ 10

Amazon offers cluster computing services, here

It can get a little costly, so plan carefully. Good luck.

ADD COMMENT • link 13.5 years ago by Adam ▴ 10

1

Entering edit mode

Depending on what "pairwise analysis" is, this might not be needed. Anyhow, it's hard to judge from the question.

ADD REPLY • link 13.5 years ago by Michael Schubert ★ 7.1k

score 0 · Answer 3 · 2011-09-10

I have done something like this using PiCloud. I ran pairwise global alignment using the needle program in Emboss (controlled within Biopython) with about a million pairs of protein sequences.

Given a working script and a preconfigured environment to run it, you can invoke their cloud.map() function to achieve massive parallelization.

You may still have to adapt your function a little bit and you have to familiarize yourself with the system but it is a lot less hassle compared to doing this directly in Amazon EC2.

You may want to check them out: http://www.picloud.com/ and http://docs.picloud.com/