How To Efficiently Parse A Huge Fastq File?
14
56
Entering edit mode
13.3 years ago
Panos ★ 1.8k

I have a 19GB fastq file (~70 million reads) and I want to find ~10,000 reads (by looking for their name) and pull out their sequence and quality. Looping through the file 10,000 times is obviously not a solution as it would easily take quite a few days!

By searching the net, I found cdbfasta that couldn't, however, index such a big file because the index file got really big and the program crashed. Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files.

How do you usually deal with such huge fastq files?

fastq next-gen-sequencing • 63k views
ADD COMMENT
2
Entering edit mode

same as before... python :) my 2 cents;) btw: check how many times perl implementation take than wc -l, as python implementation takes ~2x longer, c++ ~8x longer, python using list 140x longer. so for 70mln reads the differences might be in hours:P

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks all of you guys for the answers and for your time! I think I'll go for the Perl-based solution (since I don't know Python or C++). Generally speaking, if instead of having just 10K reads to look for, I had 100K or millions of them, what would be the best solution?

ADD REPLY
0
Entering edit mode

I have a visual app (C: Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) that shows all reads in a FastQ file. It takes about 100 seconds to open a 0.5GB file (3.3 mil reads) on an old Toshiba laptop (but when you open the file next time, it takes below 1 sec to open it). I would like to test it on a 19GB file, but I don't have such file.

So, is this fast or slow? I couldn't find any real numbers on this forum to compare mine.

ADD REPLY
0
Entering edit mode

waiting 100 seconds in a graphical user interface just to open a file is too long, a common mistake that people make is that they think that one would need to process the entire file to characterize it

read in 100k lines and report on that, then in the background continue the loading and update the results

ADD REPLY
0
Entering edit mode

You need 100 seconds when you open the file for the first time. Then is less than 1 sec. Plus, that no optimizations have been applied yet. I hope I can cut the time in half.

Anyway, your idea is great. I will do that.

ADD REPLY
0
Entering edit mode

The time is now, after some optimizations, 23.6 seconds (for the same file). Can I post this app here (in a new thread) so I can get feedback? There are any BioStars rules against this?

ADD REPLY
0
Entering edit mode

Sure just make it a Forum post

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

You can always concatenate files downloaded from SRA or generate one with random sequences/qualities to obtain a file of any size.

ADD REPLY
54
Entering edit mode
13.2 years ago
lh3 33k

ADD COMMENT
3
Entering edit mode

The current Biopython example is using the SeqRecord format method which is not recommended for speed. The documentation and examples try to be clear about this. Instead, use Bio.SeqIO.write() as in which should be much faster.

Also, in an example like this where you don't need all the objects created, a low level solution as in would be about as good as I would expect using the Biopython parser. See also http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

If you assume 4 lines per sequence things are different.

ADD REPLY
2
Entering edit mode

lh3, can you please add this ruby implementation to the benchmark list?: https://gist.github.com/1154539

ADD REPLY
2
Entering edit mode

Added. Thanks a lot.

ADD REPLY
0
Entering edit mode

I have implemented a go version. Could you please add it to your benchmarks and update when you have a chance? Thanks!

ADD REPLY
2
Entering edit mode

Your Biopython example suffers from you using the sequence's format method. The docs do recommend you use SeqIO.write() instead. Something like this should be much faster:

from Bio import SeqIO
import sys

ids = set(x[:-1] for x in open(sys.argv[1]))
wanted = (rec for rec in SeqIO.parse(sys.stdin, "fastq") if rec.id in ids)
SeqIO.write(wanted, sys.stdout, "fastq")

Of course, if that's all you're doing there is no point making all those objects. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLY
0
Entering edit mode

very informative post. thx!

ADD REPLY
0
Entering edit mode

Here is a lua version: https://gist.github.com/1159195 If you add it to the list could you please mention what version of lua you use? Ruby's performance is horrible. Perhaps you used MRI1.8 ?

ADD REPLY
0
Entering edit mode

I used the latest "official" ruby compiled from source code.

ADD REPLY
0
Entering edit mode

I used the latest "official" ruby compiled from source code. I would not be surprised that ruby is slower. Perl is designed for text processing. My impression is also that for text processing, Perl is more efficient than other language implementations.

ADD REPLY
0
Entering edit mode

I tried to improve the speed of the c version by using mmap(2). I failed on my first attempt (https://gist.github.com/1168330). I think the reason is double. There is extra overhead introduced in the logic from the fact you just get a pointer to memory instead a file descriptor. In addition, reading from stdin is faster than reading directly from a file.

ADD REPLY
0
Entering edit mode

All hail the king of benchmarks!

ADD REPLY
0
Entering edit mode

Why the Java solution use so much more memory? Is it just because of JVM?

ADD REPLY
0
Entering edit mode

that's a good question! But that should be a comment of the related answer.

ADD REPLY
15
Entering edit mode
13.3 years ago
Leszek 4.2k

I would use python (no dependencies):
1. read you read names into list1 and change list to set (it's hashable, so checking for present of element is much faster than in list)
2. parse you fastq file and check if each name is present in set of read names of interest

#!/usr/bin/python
import sys

#get fname from parameter
idfile=sys.argv[1]

#load ids
ids = set( x.strip() for x in open(idfile) )

#read stdid 
handle=sys.stdin  

while ids:
  #parse fastq
  idline=handle.readline()
  seq   =handle.readline()
  spacer=handle.readline()
  quals =handle.readline()
  #check
  id=idline[:-1]
  if id in ids:
    #print fastq
    sys.stdout.write( '%s%s%s%s' % ( idline, seq, spacer, quals ) )
    #update ids
    ids.remove( id )

My python implementation is over 3x faster than c++ by Pierre. It's due to implementation of set (hash-able) instead of usual list.

#c++
time cat PL429_q10_filtered_1.fastq | ./parse PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqcpp

real    4m21.137s
user    4m15.950s
sys 0m10.510s

#python (3.4x faster than c++)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.fqSet

real    1m11.155s
user    1m5.270s
sys 0m10.920s

#python list instead of set (56x slower than python using set!)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.noBioPython.noSet.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqNoSet

real    70m54.534s
user    70m18.650s
sys 0m16.740s

#parsing file
time cat PL429_q10_filtered_1.fastq | wc -l
115959528

real    0m30.373s
user    0m1.990s
sys 0m8.250s

My test set was 10k: first and last 5k headers of the fq. Fq was 4.9GB, 115,959,528 lines, seq between 31-75bases. I used 15k SAS drive - read speed is ~200MB/s, access time ~5ms.

ADD COMMENT
8
Entering edit mode

@lh3: I was bored today so I figured out what's causing the C++ program to be slow, it seems to be an issue piping the output through cin. If I change the input to use an ifstream and use a file Pierre's C++ program is about 2x faster than the python program. If I replace the std::set with a boost::unordered_set the program ends up being about 4x faster than the python one.

ADD REPLY
3
Entering edit mode

3X ! I am terribly offended! :-)

ADD REPLY
0
Entering edit mode

Sorry Pierre! I was really surprised, as c++ is considered the fastest language. Do you know some hashed list implementation in c++? :)

ADD REPLY
0
Entering edit mode

Do you know the performance for a hashed list implementation in Perl? Best regards.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

David, provide me with a full perl script, so I can run it.

ADD REPLY
0
Entering edit mode

@ih3 I agree you . But I thought the STL would be enough for Biostar :-) !

ADD REPLY
0
Entering edit mode

@Leszek, python sets have a remove method so you dont need to convert to a list (then back to set). also, you can create the ids like: ids = set(x.strip() for x in open(idfile)

ADD REPLY
0
Entering edit mode

set.remove() is cool, brentp. I was not aware of that. it gains another 2-3 seconds;)

ADD REPLY
0
Entering edit mode

@Pierre. The problem is hash_set is sadly not part of STL and thus not portable even between gcc versions. IMHO, the major fault of STL is not to include hash_set/map. The SGI implementation is actually quite good. BTW, personally I do not quite buy boost and the new C++ spec.

ADD REPLY
0
Entering edit mode

@Pierre: I second that we should not make simple things too complicated. I just want to understand why C++ is much slower.

ADD REPLY
0
Entering edit mode

@Pierre: I second that we should not make simple things too complicated. I just want to understand why C++ is much slower. BTW, I am not a fan of boost and the latest C++ spec.

ADD REPLY
0
Entering edit mode

@GWW: Thanks. The stdin issue is an interesting one. You may well be correct. When I tried C++'s getline() last time, I indeed fed input via stdin.

ADD REPLY
0
Entering edit mode

@lh3: nice work! if you are still boring... what about changing the loop condition to "foreach([?])" so to skip the "names.erase(r)" in every cicle of the loop and clear the set before the "return 0" ?

ADD REPLY
0
Entering edit mode

@lh3: nice work! if you are still boring... what about changing the loop condition so to skip the "names.erase(r)" in every cicle of the loop and clear the set only before the "return 0" ? It would be interesting to see if we gain more performance from skipping an update-index operation or from faster data-retrieve due to decreasing size of index tree.

ADD REPLY
0
Entering edit mode

@Pierre and @lh3 the STL is enough. std::tr1::unordered_map std::tr1::unordered_set

ADD REPLY
0
Entering edit mode

@ffcccc: GWW did the hard work. I was just guessing. Credit to him.

ADD REPLY
0
Entering edit mode

@lh give me please updated c++ code, so I can update the results. btw: I'm a bit sceptic about c++ being 4x faster than python, as simple cat is a bit over 2x faster;)

ADD REPLY
0
Entering edit mode

@Leszek: please see my answer below. @desaila: thanks for suggesting tr1. I almost forgot. More recent g++ has [?] in its standard library.

ADD REPLY
0
Entering edit mode

How do I run your python program? I have a file test.fastq.

ADD REPLY
0
Entering edit mode

What are differences between PL429_q10_filtered_1.fastq.lids and PL429_q10_filtered_1.fastq?

ADD REPLY
0
Entering edit mode

cat file.fastq | python fastq_id_parser.py file_with_ids.txt > file_with_selected_ids.fastq

ADD REPLY
12
Entering edit mode
13.3 years ago
brentp 24k

if you have only 10,000 reads you want to pull out, you can just iterate over the file once. For each record you check if the current read header is in the list (or hash) of 10,000. Something like this:

use Bio::SeqIO;
use Bio::Seq::Quality;

# read in 10K headers into a hash with value of 1 or whatever.
# you'll have to add your own code to do this.
my %headers = ();

$seqio  = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $rec = $seqio->next_seq())) {
    # if the current record doesn't exist in the 10,000, skip it.
    if(! $headers{$rec->display_id} ) { next; }
    # otherwise, write it to the output file.
    $out_fastq->write_seq($rec);
}
ADD COMMENT
1
Entering edit mode

wouldn't it be better to say if (! exists $headers{})?

ADD REPLY
1
Entering edit mode

you mean (! exists $headers{$rec->display_id}) ? That is more explicit, but it won't make a difference.

ADD REPLY
0
Entering edit mode

I took your advice brentp and started writing this Perl script. Weirdly, though, I discovered that BioPerl is VERY slow so I decided to do the file parsing "manually" (assuming there are only 4 lines per sequence) and it turned out to be A LOT faster (more than 60 times faster)!

ADD REPLY
9
Entering edit mode
13.3 years ago

Here is my C++ solution:

#include <iostream>
#include <set>
#include <string>
#include <fstream>
#include <cstdlib>
using namespace std;


int main(int argc,char **argv)
    {
    string line;
    set<string> names;
    /* reads your names */
    ifstream in(argv[1],ios::in);
    if(!in.is_open()) return EXIT_FAILURE;
    while(getline(in,line,'\n'))
        {
        names.insert(line);
        }
    in.close();
    /* reads the fastq */
    string name;
    string seq;
    string name2;
    string qual;
    set<string>::iterator r;
    /* assuming 4 lines / read */
    while(!names.empty())
        {
        if(!getline(cin,name,'\n')) break;
        if(!getline(cin,seq,'\n')) break;
        if(!getline(cin,name2,'\n')) break;
        if(!getline(cin,qual,'\n')) break;
        r=names.find(name);
        if(r==names.end()) continue;
        names.erase(r);//names are unique we don't need this one anymore
        cout << name << "\n" << seq << "\n" << name2 << "\n" << qual << endl;
        }
    return 0;
    }

Compile:

g++ -Wall -O3 -o parse parser.cpp

Example:

$ cat names.txt
@EAS54_6_R1_2_1_540_792
@EAS54_6_R1_2_1_443_348

$ gunzip -c my.fastq.gz 
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

$ gunzip -c  my.fastq.gz | ./parse names.txt
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
ADD COMMENT
0
Entering edit mode

This is terrific!!

ADD REPLY
5
Entering edit mode
13.3 years ago
Adrian ▴ 700

If all you need to do is extract the sequences, you could also just use grep, which will likely be about as fast as writing your own parser.

You can tell it to print out the line after the read name matched (-A 1) and can specify the list of patterns to search for in a file.

ADD COMMENT
2
Entering edit mode

I have used this method successfully many times. Put the IDs, one per line, in a text file called 'ids.txt' with the '@' before them. Assume your sequences are in 'seq.fq' (assume reads are on one line!)

% grep -f ids.txt -F --mmap -A 3 seq.fq | grep -v '^--' > 10000.fq

Done!

ADD REPLY
0
Entering edit mode

Dear Torst,

When I am using this syntax grep -f tr1.txt -F --mmap -A 3 che.fastq | grep -v '^--' > 10000.fq, it is dumping entire file as it without extracting desire id's sequence, My ids look like

@SRR681003.7 SN603:5:1101:70.90:105.60 length=100
@SRR681003.1 SN603:5:1101:44.60:109.30 length=100
@SRR681003.8 SN603:5:1101:63.80:106.10 length=100

but when I am doing grep -e "@SRR681003.7 SN603:5:1101:70.90:105.60 length=100" -A 3 che.fastq > output.fastq it do perfectly good. How to do -e "@SRR681003.7 SN603:5:1101:70.90:105.60 length=100 this part for id.txt please help me out.

ADD REPLY
0
Entering edit mode

This works pretty well although I will say use this:

grep -f ids.txt -F -A 3 seq.fq | grep -v '^--$' > 10000.fq

This is because sometimes qaulity line can start with 2 -- and that would screw the results.

ADD REPLY
3
Entering edit mode
13.3 years ago
Marvin ▴ 900

This doesn't really answer your question, but anyway: WTF would anyone store 70M reads in a text file and then try to access it as if it was a database?

I would convert the humongous FastQ to BAM. The conversion is lossless, saves space, and BAM/SAM are in fact easier to parse than FastQ. I don't know of a tool that would index BAM to retrieve reads by name, but if people start asking for that functionality, maybe someone will write such a tool.

ADD COMMENT
2
Entering edit mode
12.4 years ago

Hi, I have written a fast parser for FastQ files. It's in Ruby but with a C extension underneath, so it's considerably faster then pure Ruby/Python/Perl code https://github.com/fstrozzi/bioruby-faster .

Look at the Wiki for simple examples on how to use it https://github.com/fstrozzi/bioruby-faster/wiki . It's easy and you get back an array for each entry in the FastQ file, with the complete sequence header (ID + comment), the sequence and quality values.

You can then parse the file once and just filter the sequence ID of each record by searching it against a Set of IDs using for example Ruby methods like Set#include?

my_ids = Set.new ["X,"Y","Z"]
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record do |sequence_header, sequence, quality|
     seq_id = sequence_header.split("\s").first
     puts sequence_header, sequence, quality if my_ids.include? seq_id
end
ADD COMMENT
2
Entering edit mode
12.4 years ago
lh3 33k

Sorry for a separate answer. It shows the table in my previous answer, which is malformated during the stackexchange transition. I cannot edit my old answer either as it exceeds the 5000 character limit. I will remove my this answer when I can reformat my old answer properly.

=========================================================================================
 Program        Lang       User  Sys   Mem/MB    Comments
-----------------------------------------------------------------------------------------
 cat            C           0.4  10.1     0.4    Unix cat
 wc -l          C           4.3  10.4     0.4    Unix wc -l
 fqextract.c    C          17.3  10.8     2.0    Hardcoded maximum line length: 4096
 Pierre-r2.cc   C++        19.5  12.2     3.8    Use tr1::unordered_set
 seqtk          C          19.9   8.9     4.1    Multi-line fastq; subregion; gzip input
 Pierre-r1.cc   C++        22.8  12.6     3.9    Replace cin with reading /dev/stdin
 wc             C          32.0  10.6     0.4    Unix wc (no command-line options)
 fqextract.py   Python     54.0   7.9     8.0    Same algorithm as fqextract.c
 fqextract.java Java       55.8   8.9   562.7    Same algorithm as fqextract.c
 fqextract.pl   Perl       71.5  10.5     5.5    Same algorithm as fqextract.c
 Leszek.py      Python     78.0  13.3     8.0    Infinite loop given improper input
 fqextract.lua  LuaJIT    109.6   8.6     6.1    Same algorithm as fqextract.c
 readfq.py      Python    122.9  11.3     8.1    Multi-line fastq
 bp-fqiter      Python    124.5   9.9    15.2    Multi-line fastq; biopython
 drio.rb        Ruby      130.1   6.9    10.0    Same algorithm as fqextract.c
 readfq.lua     Lua       189.5  11.5     6.0    Multi-line fastq
 readfq.pl      Perl      203.1  69.6     5.7    Multi-line fastq
 Pierre.cc      C++       323.6  13.3     3.9    Same algorithm as Leszek.py
 biopython.py   Python    928.8  34.2    15.6    Multi-line fastq
 bioperl.pl     Perl    >2400.0         >15.0    Multi-line fastq; killed
 grep -f -A     C       >2400.0        >414.0    Killed; possible false positives
=========================================================================================
ADD COMMENT
1
Entering edit mode
12.8 years ago
User 7691 ▴ 10

Did anyone try do it out with:

ADD COMMENT
0
Entering edit mode

Well these days there is seqkit written in Go, which is pretty fast (the author made benchmarks against seqtk and seqmagick, where seqkit mostly matches seqtk's speed while using somewhat more memory).

ADD REPLY
0
Entering edit mode
13.3 years ago
Sequencegeek ▴ 740

You could also use a binary search. I have a library in Python that will only require you to write a small function to adapt to your file type...

ADD COMMENT
0
Entering edit mode
12.7 years ago
Kevin ▴ 640

Just a thought. I think Map/reduce in Hadoop might be super fast in doing this. Conceptually

ADD COMMENT
0
Entering edit mode
12.0 years ago
Pals ★ 1.3k

Very simple solution using biopython:

def batch_iterator(iterator, batch_size:)
    entry = True 
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
record_iter = SeqIO.parse(open("filename.fastq"), "fastq")
for i, batch in enumerate(batch_iterator(record_iter, no.of records)) :
    filename = "group_%i.fastq" % (i+1)
    handle = open(filename, "w")
    count = SeqIO.write(batch, handle, "fastq")
    handle.close()
    print "Wrote %i records to %s" % (count, filename)

source: BiopythonWiki

ADD COMMENT
0
Entering edit mode
10.4 years ago
BioApps ▴ 800

> Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files.

The idea is that you should never try to load that huge file in RAM. (you could do it if the file would be only half of installed amount of RAM). Just enumerate each record and see if it meets your criteria (the name you want). When you have found one record don't keep it in RAM. Write it to disk in a new file. An app built like this (C: Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) would require way under 10MB or RAM (with GUI included in those 10MB).

If you still need this, I can build it for you.

ADD COMMENT
0
Entering edit mode

RAM is not the issue with cdbfasta or BioPerl's Bio::Index* modules. cdbfasta is hardcoded to only construct indices up to 4 GB, so it won't work with a lot of sequences. Using the Bio::Index modules with tens of millions of sequences is theoretically possible, but it just doesn't scale well, so your script would seemingly take forever to finish. Both methods construct indices on-disk, and to be fair, they were written more than 10 years ago and were very relevant/useful at the time. These are not good approaches for modern Illumina data sets though.

ADD REPLY
0
Entering edit mode
10.2 years ago
bstrs • 0

Another really fast Python implementation which can deal also with very huge input lists of reads ids (e.g. 200 million read ids) which do not fit at once in the memory. It is using also frozenset for speed/memory reasons.

https://code.google.com/p/fusioncatcher/source/browse/extract_short_reads.py

ADD COMMENT

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6