Hi! I know this question is very Python heavy and I have already posted it on Stackoverflow but since it's a bioinformatics problem, I wanted to ask the biologists for suggestions. I have a Python3 script that uses subprocess.call
to run MACSE to align CDSs in about 2,300 input files in a directory and there are two output files for each CDS file. I have these two outputs (nucleotide and aa alignment files) going into two different directories. As smart as MACSE is, it is slow and the script can take more than a month to finish aligning all 2300 CDS files. I would like to learn how to multiprocess my script so several files can be processed at the same time using multiple cores. None of the input and output files should interact with each other so each core will be doing it's own independent task. I have been reading on the multiprocess library in Python but it might be too advanced for me to understand. Below is the script if anyone has suggestions. Thanks so much!
The Script:
import os
import subprocess
import argparse
parser = argparse.ArgumentParser(description="Script aligns CDS files.")
parser.add_argument('--root', default="~/testing_macse/", help="PATH to input dir.")
parser.add_argument('--align_NT_dir', default="~/testing_macse/NT_aligned/", help="PATH to the output directory for NT aligned CDS orthogroup files.")
parser.add_argument('--align_AA_dir', default="~/testing_macse/AA_aligned/", help="PATH to the output directory for AA aligned CDS orthogroup files.")
args = parser.parse_args()
def runMACSE(input_file, NT_output_file, AA_output_file):
MACSE_command = "java -jar ~/bin/MACSE/macse_v1.01b.jar "
MACSE_command += "-prog alignSequences "
MACSE_command += "-seq {0} -out_NT {1} -out_AA {2}".format(input_file, NT_output_file, AA_output_file)
# print(MACSE_command)
subprocess.call(MACSE_command, shell=True)
Orig_file_dir = args.root
NT_align_file_dir = args.align_NT_dir
AA_align_file_dir = args.align_AA_dir
try:
os.makedirs(NT_align_file_dir)
os.makedirs(AA_align_file_dir)
except FileExistsError as e:
print(e)
for currentFile in os.listdir(args.root):
if currentFile.endswith(".fa"):
runMACSE(args.root + currentFile, args.align_NT_dir + currentFile[:-3]+"_NT_aligned.fa", args.align_AA_dir + currentFile[:-3]+"_AA_aligned.fa")
You might consider a workflow system like snakemake for this type of thing, particularly if you have access to a compute cluster.