How can efficiently iterate, from Python, over long FASTQ records, and write them to file if some condition matches? E.g. if I want to go through the file, check the read ID for some property, and if it matches, serialize that entire entry to a new FASTQ file.
BioPython is very very slow on my system. I'd like to do this from Python but would not mind installing a C/C++ library with a Py interface.
To clarify the BioPython issue:
I open my FASTQ file (which is several GB in size), iterate through the records:
fastq_parser = SeqIO.parse(fastq_filename, "fastq")
for fastq_rec in fastq_parser:
# if some condition is met then write the FASTQ record to file
# ...
SeqIO.write(fastq_rec, output_file, "fastq")
Perhaps it'd be faster if I write all the records at the end? But then I'd have to accumulate them in memory... in any case, this is very slow.
EDIT: after profiling the culprit was SeqIO.write(). I believe SeqIO.parse() is slow too... but it's unbelievable how slow it all is, given the simplicity of FASTQ. It's too bad because I realy like the BIoPython interface. But the solution was to roll my own... thanks to all who posted.
Any recommendations?
thanks.
Heng Li is right to recommend learning about Python's generator functions - a slowdown is you calling SeqIO.write(...) many times instead of once (I see you say you profiled this). Heng Li is wrong about Biopython being slow because it copes with multi-line FASTQ, that may make a slight difference but is dwarfed by the object creation penalty of the unified SeqRecord based cross-file-format API (Bio.SeqIO). Biopython's low level string parser is pretty competitive.
Biopython is slow because it parses multi-line fastq and supposedly does error checking. It does more than you need. If you prefer the biopython interface, learn how to implement it (generator function). This will help you in the long run.
Biopython is slow because it parses multi-line fastq and supposedly checks input errors. It does more than what you need. If you prefer the biopython interface, learn how to implement it (generator function). This will help you in the long run.
Heng Li is right to recommend learning about Python's generator functions - the main slowdown is you calling SeqIO.write(...) many times instead of once. Heng Li is wrong about Biopython being slow because it copes with multi-line FASTQ, that may make a slight difference but is dwarfed by the object creation penalty of the unified SeqRecord based cross-file-format API (Bio.SeqIO). Biopython's low level string parser is pretty competitive.
yeah I tend to avoid biopython parsing for simple file types not just because it can be slow. I find myself a lot of times having to help other analyze stuff on their computers that doesn't have biopython installed. And I really think it's useful for beginners to learn simple parsing methods.
@Peter: parsing four-line fastq is over twice as fast. "Twice" is not slight difference.