Hi, I want to parse a big fasta file (253721 sequences) in order to extract one sequence per specie present in my list (which consists of 86859 different species).
I wrote some code which works great using BioPython SeqIO and regexp, but for the amount of sequences and species, it will take days to run...)
I started to look for some way of parallelizing the work using multiprocessing in python, however, I can't really connect the examples (for example here) with my code...
Here is my not-parallelized code:
#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
import subprocess
import itertools
import re
def read_names (file_to_open) :
name=[]
names = open (file_to_open , "r")
for I in names :
name.append(i.rstrip('\n'))
name=set(name)
names.close()
return(name)
names = read_names ("specie_to_ncbi.txt")
filename = "all_its.fa"
lowsize = 300
highsize = 1000
seqs=[]
seq_ids=[]
j = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
j+=1
print "there are "+str(j)+" sequences in the fasta file"
i = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
i+=1
numb = str(round((i/j)*100,2))
print numb+" % of the sequences already computed"
longueur = (len(seq_record.seq))
if (longueur > lowsize and longueur < highsize) :
ident = (seq_record.description)
for specie in names :
if specie in str(ident) :
myseq = str((seq_record.seq))
seq_ids.append(str(seq_record.id)+"|"+str(specie))
seqs.append(myseq)
names.remove(specie)
break
ITS = open("very_simple_db_ITS.fa" , "w")
for i,s in zip(seq_ids, seqs) :
ITS.write(">"+i)
ITS.write("\n")
ITS.write(s)
ITS.write("\n")
ITS.close()
</pre>
---
And here is my failed attempt to parallelize it using multiprocessing
---
<pre>
#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
from multiprocessing import Pool
import subprocess
import itertools
import re
def read_names (file_to_open) :
name=[]
names = open (file_to_open , "r")
for I in names :
name.append(i.rstrip('\n'))
name=set(name)
names.close()
return(name)
names = read_names ("head_specie_to_ncbi.txt")
filename = "head_all_its.fa"
lowsize = 300
highsize = 1000
def calc(seq_record , specie_names) :
dictio = {}
filename = "all_its.fa"
longueur = (len(seq_record.seq))
if (longueur > lowsize and longueur < highsize) :
ident = (seq_record.description)
for specie in specie_names :
if specie in str(ident) :
myseq = str((seq_record.seq))
seq_ids.append(str(seq_record.id)+"|"+str(specie))
dictio[str(ident)] = myseq
break
return dictio
if __name__ == '__main__':
pool = Pool()
mydict = pool.map(calc(SeqIO.parse(filename, "fasta"), names))
print str(mydict)
Edit : I reduced the computing time to 20 minutes by replacing the regex with "word in string" python function, but still wondering about the multiprocessing...
You might mention why it failed and with what exception. From what I see you are passing a function call to the map method, which expects an uncalled function and a sequence over which to map that function.
Also be aware that 'bee' in 'beetle' will return
True
as the string__contains__
method does substring matches and you probably are interested in word boundaries.