Ngs - Huge (Fastq) File Parsing - Which Language For Good Efficiency ?
9
10
Entering edit mode
13.8 years ago
toni ★ 2.2k

Dear Biostar members,

I am involved in analyzing data from NGS technologies. As you may know, raw data files are in fastq format. And the fact is that these files are huge (several gigabytes each).

As a first step, we naturally want to make some quality controls of the sequencing step. My team and I are aware of some tools like Picard or bamtools that give some information and statistics, but they do not allow to extract all the information we want. So, at some point, I am afraid we will have to parse fastq files ourselves.

Which programming language would you choose to do the job (ie : read and process a 10GB file) ? Obviously, the main goal is to make the processing as fast as possible.

  1. awk/sed
  2. C
  3. Perl
  4. Java
  5. Python

Do you think that we will be I/O bound anyway, so using a language or another makes no big difference ?

Would you read the file line by line or load chunks of hundreds of lines (for instance) ?

Many thanks. T.

next-gen sequencing parsing fastq programming • 28k views
ADD COMMENT
4
Entering edit mode
  1. Haskell!

Here's converting a 10G fastq file to fasta:

ghc -e 'Bio.Sequence.readIllumina "100903_s_8_1_seq_GLW-4.txt" >>= Bio.Sequence.writeFasta "/dev/null"'

On my system, a 10G file took 2m20s using 67% CPU, so parsing isn't terribly CPU intensive, and using a lower level language will make it even less of a problem.

ADD REPLY
1
Entering edit mode

Hi. You forgot to mention something important. Your CPU. Without this info you cannot tell if 2 minutes is a slow or fast! Also, since your program takes 67% I conclude the program is multi-threaded. But we don't know how many cores you CPU have.

ADD REPLY
1
Entering edit mode

If the program takes 67% of one cpu core then it's more reasonable to assume that it's IO bound, and if any assumption is to be made about number of threads from this we would say it's a single thread that is waiting 33% of the time.

ADD REPLY
2
Entering edit mode

Have you checked out FastQC to make sure you aren't reinventing the wheel? http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD REPLY
1
Entering edit mode

Make sure that you do everything with compression (gzip, etc.) when possible. This reduces both network bandwidth and disk IO. If you are IO bound, doing so can actually make the task faster in some cases.

ADD REPLY
1
Entering edit mode

I think it's time for another programming challenge

ADD REPLY
0
Entering edit mode
  1. Haskell:

Heres converting a 10G fastq file to Fasta:

% time ghc -e 'Bio.Sequence.readIllumina "100903_s_8_1_seq_GLW-4.txt" >>= Bio.Sequence.writeFasta "/dev/null"'
ghc -e   84.68s user 9.93s system 67% cpu 2:20.56 total

Note that disk IO is the bottleneck here, not CPU.

ADD REPLY
0
Entering edit mode

Sorry about the idiot formatting, btw - I don't know how to get paragraphs in comments.

ADD REPLY
9
Entering edit mode
13.8 years ago

The short answer is: it depends a lot on what exactly it is you need to do with the contents of the files.

Regarding being I/O bound, I think it is quite unlikely unless you have the data on very slow disks. Even a single 5400rpm disk can read in the order of 50MB/s, which means that reading a 10GB file will not take much over 3min. Since you are working with large NGS datasets, I would assume that you have a RAID system, in which case the combined speed of multiple disks will almost certainly exceed how fast you can process the data.

When it comes to speed, nothing beats a good C or C++ implementation. However, there is a huge price to pay in terms of development time. If what you do is pretty basic string operations, both Perl and Python should do fine, and the savings in development time are most likely well worth the slightly longer run time compared to a C or C++ implementation.

In my experience, Perl has a more efficient implementation of regular expressions. Conversely, if you need to load all the data into memory, the Python data structures appear to be more memory efficient.

Awk is the only of the languages that you list that I would advise you against. It is very convenient for writing short command-line hacks, but speed is not one of its virtues. I don't have any experience with using Java for bioinformatics tasks.

ADD COMMENT
4
Entering edit mode

+1 for pointing out that I/O is frequently not the bottleneck on local disks. Also a comment on regex. For short regex without "|", Python2 is >10X slower than Perl. On other things, python is faster but never like 10X. If you rely on regex a lot like me, perl is preferred (I do not know how Python3 performs).

ADD REPLY
1
Entering edit mode

Thanks a lot Lars. And you were definitely true about being I/O bound...that was not the case at all. Finally I had a try with Perl and when I read the file (7Go) line by line without doing any ops on strings it took roughly 2minutes. Manipulating both sequence and quality strings in a loop while parsing brought the runtime up to 1h25 min. (I process each string character by character). I guess that the culprit is Perl. As an interpreted language it is too slow. I would save lot of execution time if I write this parser in a compiled language. Don't you think ?

ADD REPLY
0
Entering edit mode

If what you do is to loop over the strings character by character, I have no doubt that you can win a lot by using a compiled language. Would you mind elaborating a bit on what exactly it is you need to do to the sequences and quality strings?

ADD REPLY
0
Entering edit mode

If you're using python and regular expressions are a bottleneck, you can try re2: http://pypi.python.org/pypi/re2/

ADD REPLY
4
Entering edit mode
13.8 years ago
Alex ★ 1.5k

A cheap and fast solution is Amazon EC2.

For 2$ per hour per instance you can rent one ore more High-Memory Quadruple Extra Large Instance:

68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

I usually use this NGS analyse workflow:

  1. Upload data in tar.gz to EC2 instance (usually this is a bottleneck) or upload data from NCBI.
  2. The main script (Python, but it can be any other language) loads data into memory and creates independent parallel tasks.
  3. The main script launches workers (usually in C if task is compute-intensive other way in Python, or is can be some tool for NGS analysis e.g. bowtie). With one instance you get ~25 parallel workers. With two instances ~50 and so on.
  4. Workers return computed data to the main script or save it to files.
  5. Then the main script or other python script joins/reduces data computed by workers and packs result to tar.gz.
  6. Download data from Amazon (next bootleneck =)

As example, an enhanced suffix array computation for >10 Gbases fastq takes ~3 hours and 7$ where 1-2h for data upload/download. And it can be improved with more sophisticated workers.

I use Python for the main script because I like/know it. Although languages like Google's Go or Erlang will be better here. Or you can use classical map/reduce engine with Hadoop.

Links to EC2 getting started guides I've posted here

ADD COMMENT
0
Entering edit mode

@Alex, after the up and download times, does this save you time overall when compared to running it on your local machine--no matter how modest it may be?

ADD REPLY
0
Entering edit mode

@brentp, same task on Phenom II 965 x4 with 16Gb memory takes >18h (to fit into memory dataset was split to small parts). EC2 solves memory limitations without buying expensive hardware.

ADD REPLY
4
Entering edit mode
10.9 years ago
Samuel Lampa ★ 1.3k

I recently made a naive benchmark of a few popular languages, for doing really simple bioinformatics processing of fasta files (I chose to count the GC frequency, as a simple enough problem that I could figure out how to implement that in a number of different languages).

After publishing the initial benchmarks (between Python and the D language), I got a lot of feedback and suggestions for improvements, from experts in the different languages, and so the comparison grew quite a bit, and I finally got together a comparison of no less than the following languages, sorted approximately from slowest to fastest (when parsing line by line):

The final results are available in this blog post:

My personal take-home lesson from this comparison is that by using modern compiled and garbage collected language like Go and D, you can get rather close to C++/C like performance, while not having to fear introducing too many catastrophic memory leaks (due to the garbage collector) and foremost, retaining much easier to read code. As an example, have a look at these code examples:

Personally, I have to say I was highly impressed by the D version, due to it's high readability even in the face of the serious optimizations. On the other hand, although the Go version is a bit harder to read, the Go language is an increasingly popular language with a decently sized community and has a very easy to use features for writing threaded programs (although I could not get any speedups using threading for this simple example).

In terms of ready-to-use libraries, there is an up to date bioinformatics package in Go, called BioGo, that seems promising and high-quality. (For D, there is some very rudimentary support for working with some bioinformatics related data in the DScience library, although it does not seem to be very updated).

In the end, I would put my bet on Go for the time being due to it's much bigger community, and the better library support (BioGo), although I like D a lot too.

If you really don't want to get the benefits from learning a compiled language, trying to get your python code with PyPy might be an option too of course (although I heard about some incompatibilities with the NumPy/SciPy packages etc, which sounds worrying to me).

ADD COMMENT
3
Entering edit mode

Wow! FreePascal is only on 4th position? I expected to have a higher position.

I looked a bit over your Pascal code and it uses ReadLn instead of the modern TFileStream. And then you are using:

(c = 'A') or (c = 'G') or (c = 'C') or (c = 'T')

which are 4 operations instead of one, like in:

c in ['A', 'C', 'G', 'T']

or

case c of ...

I recently wrote a program in a Pascal-related language that reads FastQ files (and SFF). After some optimizations like the ones explained above I decreased the parsing time from 1 minute (see version 1.5) to 10 seconds (see v1.7).

If you are interested I could optimize your FPC code. I have FPC installed. I only need your FastQ test file.

ADD REPLY
3
Entering edit mode

Any news about this optimization? It should bring a TREMENDOUS increase in speed.

ADD REPLY
0
Entering edit mode

Many thanks for the addition, and sorry about the silence, have had very busy times! I'm planning to do a new revision of the benchmarks, as have gotten a lot of feedback on a bunch of the other languages as well, and even new ones such as julia and elixir, so stay tuned!

The main problem is I have to rerun all the benchmarks, since I don't have access to the same computer I used for the first ones. Should need to create some better organization for the benchmarks as well, so it can more easily be cloned and run at any time.

ADD REPLY
2
Entering edit mode

Perfect! I think we will see major improvements after changing that IF to CASE. At ASM level, CASE executes in only one micro instruction.

PS: Don't forget to turn on the 'compiler optimization" {$OPTIMIZATION ON} and to turn off the 'Assertions', 'Range checking' {$R-} and 'Buffer overflow checking'.


Can you run a Delphi test also? I could send the code and the exe also, but you will need Windows or Linux with WinE.

ADD REPLY
1
Entering edit mode

Free Pascal and Lazarus have even more optimization settings, see http://www.freepascal.org/docs-html/3.0.0/prog/progse49.html and http://wiki.freepascal.org/Whole_Program_Optimization for details.

ADD REPLY
1
Entering edit mode

I should add that I'm not particularly fond of Java as a bioinformatics language, due to the 100+ ms startup latency for the JVM, the inflexible memory handling and the sometimes complicated installation procedures. Also, the big amount of boilerplate code needed makes it more or less depend on using a full-blown IDE, which can be really impractical in cluster environments that only support command line based text editors, or where big resource hogs like Eclipse are really impractical to run.

ADD REPLY
1
Entering edit mode

No Perl? The workhorse of most bioinformaticians around the world. Please see comments on Perl being >10X faster than Python in regex parsing.

ADD REPLY
0
Entering edit mode

+1 Nice summary. I wanted to mention that there is a BioD project, which seems to be more active than the other D library you mention. I cannot comment on the quality or completeness of either because I have not used them.

ADD REPLY
0
Entering edit mode

That NimRod language is pretty interesting - I will benchmark their hash tables and arrays - if these can handle tens of millions of elements with half decent performance (like say Python) I think I' might start using it.

http://nimrod-lang.org/documentation.html

ADD REPLY
0
Entering edit mode

nimrod would be a lot more interesting if it had a large community.

ADD REPLY
0
Entering edit mode

one weird thing I noted it does not seem to have output formatting for numbers in the language (printf equivalent) - one needs to install an unofficial package to do that - that made me drop my investigation. IMHO that should be part of the core of any language that one wants to do science with. Made me wonder what else is missing that I can only find out later.

ADD REPLY
0
Entering edit mode

So many interesting languages. D, Rust and Go are all good. I quite like C#, too, but Mono is not as good as MS runtime. Apple's new Swift also looks interesting. Life is too short to learn exciting programing languages...

ADD REPLY
0
Entering edit mode

I think Julia will be the one I look into next. You could say that life is too short not to learn interesting programming languages.

ADD REPLY
0
Entering edit mode

Unfortunately, the link to your final results is no longer functional. Would you mind posting an updated URL?

ADD REPLY
1
Entering edit mode

@Johannes: Hi, sorry, we lost the .org domain, but got the .com one, where I have the site archived. Updated the link now, which is: https://saml.rilspace.com/moar-languagez-gc-content-in-python-d-fpc-c-and-c

I have also started working on an updated benchmark though, including some new languages. Results and code available here: https://github.com/samuell/gccontent-benchmark#readme

ADD REPLY
2
Entering edit mode
13.8 years ago
Science_Robot ★ 1.1k

You don't need to store the entire FASTQ file in memory. So you're limited mostly by disk I/O. In that case, it doesn't really matter which programming language you use.

All that matters is which language you're most comfortable with.

ADD COMMENT
2
Entering edit mode

While disks have very high latency, their throughput is perfectly reasonable.

ADD REPLY
2
Entering edit mode
13.8 years ago

Long Answer

Assuming that you are parsing to gather some statistics/information which will end up being not very big (i.e. the results will be close to constant size - independent of what the input FASTQ file size)

I would use Python, Perl, or Ruby. They are powerful at string processing. The performance will be fine as long as you don't try to pull the entire file into memory. Make a simple a while loop that processes the file line by line. I don't think there is a need to grab chunks of lines because:

  1. Your parsing operations will probably not have much significant overhead.
  2. The interpreter and/or the OS will buffer the input automatically so the processing will be done while the OS is waiting for I/O.

I usually prefer to use the pipes (e.g. cat file.fastq | ./myprog) but with the more common way (to read the file name from the arguments and to open the file inside your program ./myprog f1.fastq) the underling mechanisms should be exactly the same so I would not expect any performance differences.

Sed and Awk are great but I found that once the parsing gets a little sophisticated then I always need to switch to a general purpose programming.

Short Answer

No matter what language you choose, it's all pretty optimized automatically for you.

ADD COMMENT
0
Entering edit mode

Bash would not work - it often does things like fork other processes

ADD REPLY
2
Entering edit mode
13.8 years ago
Phis ★ 1.1k

For speed/efficiency, I'd definitely go with C or C++ (or Delphi if you're working on Windows). Also, usually the preferred way to deal with large files is to use memory mapped files that are supported by all major OSes (as far as I know). You'd usually only have part of the file (a 'view') in memory at any one time.

If it is Windows, the main API calls you want to look at are CreateFileMapping, MapViewOfFile and UnmapViewofFile.

ADD COMMENT
0
Entering edit mode

If you are not on Windows Free Pascal (with or without the Lazarus IDE) may be a good alternative to Delphi.

ADD REPLY
2
Entering edit mode
13.8 years ago

I always prefer high level languages as my own time is more valuable than machine time -> it would need to save a week of compute time for me to invest a week of development.

Moreover depending on the operations that you wish to do you may not even be able to substantially improve on the execution speed of perl/python etc. Basic I/O, splitting, tokenizing the inputs may even be faster when performed in these languages. For more complex operations time them and extrapolate from them, for example:

$ python -m timeit "'a a a a a a'.split()[0]"
1000000 loops, best of 3: 0.528 usec per loop

this clocks in at 0.5 microseconds(!) - you'd be hard pressed to write a better line splitter that returns a list from which you extract the last element

ADD COMMENT
1
Entering edit mode
13.8 years ago
Ian 6.1k

This may not be quite the answer you are looking for. But there is a useful QC tool that works with SOLiD and Illumina (not sure about others) FASTQ files. It is called FASTQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/.

It is a JAVA tool (which can be run by command line) that provides a variety of QC reports. The most useful I think is a box plot of quality values per base. Other reports include detection of contaminant DNA etc.

ADD COMMENT
0
Entering edit mode

@Ian. yep. Thank you Ian. We already tried fastqc, it's typically the kind of report we would like to produce but we planned to insert many more things. Thanks for the advise !

ADD REPLY

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6