Entering edit mode
2.2 years ago
ngarber
▴
60
I have a dataframe that contains, for each row, a short sequence from a human protein, as well as the homologous full sequence from a mouse protein. I'm using pairwise2 to align them and extract the mouse equivalent to the short sequence in the human protein.
Unfortunately, pairwise2 is very slow, so I would like to parallelize this process to speed it up, possibly with Dask or another multiprocessing platform. How would I do that for a multi-line operation as follows?
for i in np.arange(len(data_df)):
human_motif = data_df.at[i, "Human_Motif"]
mouse_sequence = data_df.at[i, "Mouse_Sequence"]
gap_start_penalty = -15
gap_extend_penalty = -15
alignments = pairwise2.align.globalxs(human_motif, mouse_sequence, gap_start_penalty, gap_extend_penalty)
best_alignment_human = alignments[0][0]
best_alignment_mouse = alignments[0][1]
#Find index for when the gapless aligned human motif starts
for j, char in enumerate(best_alignment_human):
if char != "-":
aligned_motif_start = j
break
mouse_motif = mouse_sequence[aligned_motif_start : aligned_motif_start + len(human_motif)]
data_df.at[i, "Mouse_Motif"] = mouse_motif
What's the best way to parallelize this?