Question

Parallelization of Pairwise2 on Dataframe Rows?

0

Entering edit mode

2.9 years ago

ngarber ▴ 60

I have a dataframe that contains, for each row, a short sequence from a human protein, as well as the homologous full sequence from a mouse protein. I'm using pairwise2 to align them and extract the mouse equivalent to the short sequence in the human protein.

Unfortunately, pairwise2 is very slow, so I would like to parallelize this process to speed it up, possibly with Dask or another multiprocessing platform. How would I do that for a multi-line operation as follows?

for i in np.arange(len(data_df)): 
    human_motif = data_df.at[i, "Human_Motif"]
    mouse_sequence = data_df.at[i, "Mouse_Sequence"]

    gap_start_penalty = -15
    gap_extend_penalty = -15
    alignments = pairwise2.align.globalxs(human_motif, mouse_sequence, gap_start_penalty, gap_extend_penalty)

    best_alignment_human = alignments[0][0]
    best_alignment_mouse = alignments[0][1]

    #Find index for when the gapless aligned human motif starts
    for j, char in enumerate(best_alignment_human): 
        if char != "-": 
            aligned_motif_start = j
            break

    mouse_motif = mouse_sequence[aligned_motif_start : aligned_motif_start + len(human_motif)]

    data_df.at[i, "Mouse_Motif"] = mouse_motif

What's the best way to parallelize this?

BioPython parallelization BLAST Python pairwise2 • 778 views

ADD COMMENT • link updated 2.9 years ago by zorbax ▴ 650 • written 2.9 years ago by ngarber ▴ 60

score 5 · Accepted Answer · 2022-09-02

You can use Pool

import pandas as pd
from Bio import pairwise2
from multiprocessing import Pool

THREADS=8

def pairwise_alignment(df):
    alignments = []
    for k, v in df.iterrows():
        human_motif = df.at[k, "Human_Motif"]
        mouse_sequence = df.at[k, "Mouse_Sequence"]

        gap_start_penalty = -15
        gap_extend_penalty = -15
        result = pairwise2.align.globalxs(human_motif, mouse_sequence, gap_start_penalty, gap_extend_penalty)
        alignments.append(result)
    return alignments


pool = Pool(processes=THREADS)
pool_results = pool.map(pairwise_alignment, data_df)