Efficiently Iterating Over Fastq Records From Python
5
4
Entering edit mode
13.1 years ago
User 9996 ▴ 840

How can efficiently iterate, from Python, over long FASTQ records, and write them to file if some condition matches? E.g. if I want to go through the file, check the read ID for some property, and if it matches, serialize that entire entry to a new FASTQ file.

BioPython is very very slow on my system. I'd like to do this from Python but would not mind installing a C/C++ library with a Py interface.

To clarify the BioPython issue:

I open my FASTQ file (which is several GB in size), iterate through the records:

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
for fastq_rec in fastq_parser:     
  # if some condition is met then write the FASTQ record to file
  # ...
  SeqIO.write(fastq_rec, output_file, "fastq")

Perhaps it'd be faster if I write all the records at the end? But then I'd have to accumulate them in memory... in any case, this is very slow.

EDIT: after profiling the culprit was SeqIO.write(). I believe SeqIO.parse() is slow too... but it's unbelievable how slow it all is, given the simplicity of FASTQ. It's too bad because I realy like the BIoPython interface. But the solution was to roll my own... thanks to all who posted.

Any recommendations?

thanks.

python fasta fastq biopython next-gen sequencing • 26k views
ADD COMMENT
1
Entering edit mode

Heng Li is right to recommend learning about Python's generator functions - a slowdown is you calling SeqIO.write(...) many times instead of once (I see you say you profiled this). Heng Li is wrong about Biopython being slow because it copes with multi-line FASTQ, that may make a slight difference but is dwarfed by the object creation penalty of the unified SeqRecord based cross-file-format API (Bio.SeqIO). Biopython's low level string parser is pretty competitive.

ADD REPLY
0
Entering edit mode

Biopython is slow because it parses multi-line fastq and supposedly does error checking. It does more than you need. If you prefer the biopython interface, learn how to implement it (generator function). This will help you in the long run.

ADD REPLY
0
Entering edit mode

Biopython is slow because it parses multi-line fastq and supposedly checks input errors. It does more than what you need. If you prefer the biopython interface, learn how to implement it (generator function). This will help you in the long run.

ADD REPLY
0
Entering edit mode

Heng Li is right to recommend learning about Python's generator functions - the main slowdown is you calling SeqIO.write(...) many times instead of once. Heng Li is wrong about Biopython being slow because it copes with multi-line FASTQ, that may make a slight difference but is dwarfed by the object creation penalty of the unified SeqRecord based cross-file-format API (Bio.SeqIO). Biopython's low level string parser is pretty competitive.

ADD REPLY
0
Entering edit mode

yeah I tend to avoid biopython parsing for simple file types not just because it can be slow. I find myself a lot of times having to help other analyze stuff on their computers that doesn't have biopython installed. And I really think it's useful for beginners to learn simple parsing methods.

ADD REPLY
0
Entering edit mode

@Peter: parsing four-line fastq is over twice as fast. "Twice" is not slight difference.

ADD REPLY
4
Entering edit mode
13.1 years ago
Peter 6.0k

Heng Li is right to recommend learning about generators - the main problem in your code is calling SeqIO.write(...) thousands of times instead of once.

If you want to use SeqRecord objects via SeqIO, then use a generator expression like this:

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")

Or equivalently a generator function,

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
def my_filter(records):
    for rec in records:
        if ...:
            yield rec
SeqIO.write(my_filter(fastq_parser), output_file, "fastq")

There are examples of this kind of thing in the Biopython Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html - e.g. "Filtering a sequence file" in the "Cookbook" chapter.

However, if you don't need the SeqRecord objects and the unified cross-file format API, you're paying a speed penalty in object creation. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ for that.

ADD COMMENT
0
Entering edit mode

@Peter: you may read my answer to the following question: http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-a-huge-fastq-file. For python and so on, parsing multi-line fastq is over twice as slow as parsing four-line fastq. Biopython is nearly 20 times as slow. In my opinion, it is necessary to provide an alternative parser that is purely optimized for parsing four-line fastq. From what I have heard, the inefficiency of the biopython parser has pushed many away.

ADD REPLY
0
Entering edit mode

@Peter: you may read my answer to the following question: http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-a-huge-fastq-file. For python and so on, parsing multi-line fastq is over twice as slow as parsing four-line fastq. Biopython is nearly 20 times as slow. In my opinion, it is necessary to provide an alternative parser that is purely optimized for parsing four-line fastq. From what I have heard, the inefficiency of the biopython parser has pushed many away.

ADD REPLY
0
Entering edit mode

Is it faster to write a fastq file record by record or to make a list of records and then write the list of records?

ADD REPLY
2
Entering edit mode
13.1 years ago
lh3 33k

If you have four-line fastq, write your own parser. It is very easy and much faster than biopython. If you have multi-line fastq, use my readfq.py (with brentp's improvements).

Actually your question is a duplicate of this question, listed as the first "related question" in the sidebar. You should read my answer to that question.

Generally, biopython/bioperl are likely to trade efficiency for modularity and robustness. Frequently they are not the fastest implementations.

ADD COMMENT
2
Entering edit mode

It would be much faster than the high level Biopython parser, but have you compared that with the equivalent in Biopython, function FastqGeneralIterator which also returns tuples of three strings. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLY
1
Entering edit mode

Comparison added. FastqGeneralIterator is as fast as my readfq, but it parses fastq only (readfq seamlessly parses fasta). This means the biopython parser is still over twice as slow as a simplistic four-line fastq parser, and 6X as slow as a multi-line fasta/q C parser.

ADD REPLY
0
Entering edit mode

Comparison added. FastqGeneralIterator is as fast as my readfq, but it parses fastq only (readfq seamlessly parses fasta). This means the biopython parser is still over twice as slow as a simplistic four-line fastq parser, and 6X as slow as a C parser.

ADD REPLY
1
Entering edit mode
13.1 years ago

I have already experienced the SeqIO.parse() low speed...

The only thing I can advice is to avoid this function and to use the dict type. For instance, to read a (one line per seq) fasta file, I did like this... This goes really faster.

#### reading the fasta file
handle = open(fastafile)

seqs={}
while True:
    try:
        seq_id = handle.next().strip("\n")
        seq = handle.next().strip("\n")
        seqs[seq_id]=seq
    except StopIteration:
        break

handle.close()

This should be easy to adapt to a fastq.

ADD COMMENT
3
Entering edit mode

If you want FASTQ files parsed as strings in Biopython, use FastqGeneralIterator - see http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLY
1
Entering edit mode
7.6 years ago

In this recent bioinformatics SE answer, I tried the new (as of June 2017) python API for GATB. It contains a fasta/fastq parser module called "Bank" that seems quite fast.

For instance, to only print reads longer than 60 nucleotides:

from gatb import Bank
fastq_parser = Bank(fastq_filename)
for seq in fastq_parser:
    if len(seq) >= 60:
        print("@{name}\n{seq}\n+\n{qual}".format(
            name=seq.comment.decode("utf-8"),
            seq=seq.sequence.decode("utf-8"),
            qual=seq.quality.decode("utf-8")))

See https://github.com/GATB/pyGATB/wiki/Bank-API for examples of how to access sequence features.

ADD COMMENT
0
Entering edit mode
6.5 years ago
hurfdurf ▴ 490

Yet another option, but python 3.5+ only:

https://github.com/lgautier/fastq-and-furious

ADD COMMENT

Login before adding your answer.

Traffic: 3071 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6