I'm looking to run a number of Fasta files simultaneously using local blast for a pipeline. Im using biopython to read in my input file and parse through for a specified number of sequences e.g. 1000 if the file is larger i batch it out into segments of 1000. But i'm now looking for a way be able to run each file through local blast rather than one at a time and then concatenating all the output files i receive for post-blast parsing by E-value.
def batch_iterator(iterator, batch_size):#generator function for splitting large files
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.__next__()
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
counter =0
for record in SeqIO.parse(Input_file,'fasta'):# pasrse input file for sequence #
counter +=1
if counter > 10:#if input file has more than 10 seqs file is batched
record_iter=SeqIO.parse(open(Input_file),"fasta")
for i, batch in enumerate(batch_iterator(record_iter, 10)):
filename = "batch_%i.fasta" % (i + 1)
with open(filename, "w") as handle:
count = SeqIO.write(batch, handle, "fasta")
print("Write %i records to %s" % (count, filename))
else:
continue
What would be the best way to automate this so i grab all my batch files and run them through local blast? Would i have to use ./blastp -db a_database -query queryfile.fasta -out blastoutpu.tsv -outfmt 6
for individual file name using the os(command) in my script or is there a simpler way?
Is there any specific reason you want to batch this analysis (eg. run it in parallel on a compute cluster) ? otherwise it will be more efficient to run the blast with one big input file
I found this if that could help : https://gif.biotech.iastate.edu/running-blast-jobs-parallel
That link is actually where i started my query, i'm just wondering if there's a way to do it in python rather than bash.
From 2008 (could be out of age) : http://bpbio.blogspot.fr/2008/02/parallel-blasts-using-pythons-pp-module.html
You could use the
multiprocessing
python module.Alternatively, use
subprocess
to passblast
commands toGNU Parallel
at the commandline. How you batch up the files before invoking either of these would be entirely up to you in the python script.