This question is correctly answered, but I just want to make a performance point about file I/O.
Closing and opening files is much more expensive than even large writes, particularly in the middle of repeated writes. It takes a lot of time both for what it causes - flushing of all buffers (io library and file system) and fsyncs - but also for what it prevents, which is buffering and write coallecense. It's often advantageous if you can move such operations outside of tight loops - particularly if you're going to be rewriting to the same file later (which isn't the case in this particular use, but often can be). The difference in performance can be seen on your laptop, but the results are even more dire on a large shared file system as one might see on a cluster. So consider the following toy example:
#!/usr/bin/env python
import os
import os.path
filename="test.out"
teststring="This is a test of file I/O.\n"
nwrites=100
def with_closes():
if os.path.isfile(filename):
os.remove(filename)
for writen in xrange(nwrites):
of=open(filename,"a")
of.write(teststring)
of.close()
def without_closes():
if os.path.isfile(filename):
os.remove(filename)
of=open(filename,"a")
for writen in xrange(nwrites):
of.write(teststring)
of.close()
if __name__ == '__main__':
import timeit
print("With opens/closes interleaved:")
print(timeit.timeit("__main__.with_closes()",
setup="import __main__",
number=50))
print("Without opens/closes interleaved:")
print(timeit.timeit("__main__.without_closes()",
setup="import __main__",
number=50))
Running this gives:
$ ./timings.py
With opens/closes interleaved:
9.02220511436
Without opens/closes interleaved:
0.398212909698
That is, putting closes in there slowed things down by nearly a factor of 20. (It's actually a bit worse than that, because we're including the file deletion in the timings). This is worst for frequent small writes to the same file.
For an assembled reference genome, since there are a comparitively small number of large sequences, and each write is going to be unique (that is, once we write to a file and close it, we're done with it), it's not as big an issue, but still, every little bit helps. Since there are guaranteed to be a modest number of files written to, we can just open them as they're needed and close them when everything's done:
#!/usr/bin/env python
from Bio import SeqIO
import sys
def filelookup(chrom, file_dict):
if not chrom in file_dict:
file_dict[chrom] = open("%s.fa" % (rec.id), "w")
return file_dict[chrom]
def closefiles(file_dict):
for open_file in file_dict.values():
open_file.close()
if(len(sys.argv) != 2) :
sys.exit("Usage: %s file.fa" % sys.argv[0])
files={}
f=SeqIO.parse(sys.argv[1], "fasta")
for rec in f :
of=filelookup(rec.id, files);
SeqIO.write(rec, of, "fasta")
closefiles(files)
Running gives
$ time ./open-close.py Homo_sapiens_assembly19.fasta
real 1m49.739s
user 1m8.400s
sys 0m4.212s
$ time ./cache-files.py Homo_sapiens_assembly19.fasta
real 1m19.610s
user 1m7.556s
sys 0m3.324s
So even in this case (with few large writes, and only one write per file) we save about 30% of the time (and note that user+sys time now comes pretty close to equalling real time; the remainder is time spent waiting for I/O).
If this only is done once, of course, it doesn't really matter, but if it becomes a regular part of a pipeline it starts to add up. And in the more general case where one's pulling a file into separate components where there _will_ be multiple outputs to the same file, it can make a much more substantial difference.
Isn't this question the same as https://www.biostars.org/p/105388/?
showing similar posts dynamically while typing the question would be good to avoid repeated questions, like stackoverflow.