I have 1000 fasta files that have simulated reads, and I want to split each of these 1000 files into separate files (one per chromosome) as I need this for some further analysis. The header lines look like this >chr1:startpos:endpos.
I wrote this code (https://gist.github.com/ethanagbaker/6e40c58127b7ca8b9242) in Python, and it works, but it is very slow (ie it has been running for more than 24 hours on a cluster). This seems like it should be a really quick thing to do. Is there a better way to be doing this?
Or you can use the Fasta class and write your own script to do the same thing. Your code is slow because it is opening a bunch of files in a loop, and then opening (the same files?) and reading them completely to get the sequences. Indexed file access will be orders of magnitude faster.
There are three reasons why your Python code works very slow:
You unnecessarily open and close each outut file million of times. Opening/closing files is very expensive. Most likely, you'll reduce the running time greatly if you open all the files for writing at the beginning. I suggest that you keep these files in a dictionary (d = {(name1, chr1): open(pwd+name+"_"+chr+".fa", 'w'), ...}. Then, in your forloop where you iterate over fasta records, put a variable output = d[(name, chrom)] and then write a record to that file by output.write(record.format('fasta')).
You're calling SeqIO.write(...) many times instead of once. Instead, use: output.write(record.format('fasta')).
Parsing FASTA using Biopython checks for some input errors and this might take some time. It does more than you need in your case.
Maybe you'd like to try the solutions of these posts: How To Split A Multiple Fasta and how to convert a long fasta-file into many separate single fasta sequences
faSplit from Jim Kent @UCSC. Link is for Linux but OS X version available along with source.