Why is Hadoop not used a lot in bio-informatics?
7
18
Entering edit mode
10.1 years ago
William ★ 5.3k

Why is Hadoop not used a lot in bio-informatics? At least in my experience I don't see Hadoop being used at local research groups or at the well known and well funded research groups in the UK and USA. Though Hadoop offers completely distributed IO and CPU power which should be very attractive for large bio-informatics data analysis.

Is it that that the type of files are not suitable for Hadoop? For example a 1000 large binary BAM files each 100GB ? Can Hadoop work with binary files of that size?

Or is it that the common tools like BWA, Picard, HTS-JDK and GATK can't be run natively on Hadoop?

Mapping 1000 FastQ files to Sam files is something which can be done in parallel for every record in the FastQ files and is I think well suited for Hadoop.

But is Mapping entire Bam files to gVCF (mapping as in the functional MapRecure paradigm) something that can be done on Hadoop?

And is reducing (reducing as in the functional MapRecure paradigm) the gVCF files to VCF files something that can be done on Hadoop?

As you might have gathered I only have a limited knowledge of Hadoop and the Hadoop file system ( HDFS) and I am wondering if they are suitable for common genomics data formats and common analysis steps like mapping reads with for example BWA and variant calling with for example GATK Haplotype Caller.

edit:

I found Hadoop-BAM and SeqPig but I am wondering if these are just papers / technical proof of concepts or if they also see any real world use?

hadoop • 20k views
ADD COMMENT
1
Entering edit mode

You might be interested in this thread if you haven't already seen it Distributed Computing In Bioinformatics

ADD REPLY
20
Entering edit mode
10.1 years ago
lh3 33k

Most of applications you mentioned can be and have already been implemented on top of hapdoop. A good examples is the ADAM format, a hapdoop friendly replacement of BAM, and its associated tools. They are under active development by professional programmers. Nonetheless, I see a few obstacles to its wider adoption:

  1. It is harder to find a local hadoop cluster. My impression is that hadoop really shines in large scale cloud computing where we have a huge (virtual) pool of resources and can respond users on demand. In a multi-user environment given limited resources, I don't know if a local hadoop is as good as LSF/SGE in terms of fairly balancing resources across users.
  2. We can use AWS, google cloud, etc, but we have to pay. Some research labs may find this unfamiliar. Those who have free access to institution wide resources would be even more reluctant.
  3. Some pipelines are able to call variants from 1 billion raw reads in 24 hours with multiple CPU cores. This is already good enough in comparison to the time and cost spent on sequencing. There is not a huge need of better technologies. In addition, although hadoop frequently saves wall-clock time due to its scalability, at times it wastes CPU cycles on its extra layer. In a production setting, the total CPU time across many jobs matters more than the wall-clock time of a single job. Some argue that the compute-close-to-data model of hadoop is better, but for many analyses we only read through data once. The data transferred over network is the same as dispatching data in the hadoop model.
  4. Improvements to algorithms frequently have much bigger impact on data processing than using a better technology. For example, there is a hadoop version of MarkDuplicates that takes much less wall-clock time (more CPU time, though) than Picard. However, recent streamed algorithms, such as SamBlaster and the new Picard, can do this faster in terms of both CPU and wall-clock time. For another example, there is a concern with the technical difficulty in multi-sample variant calling, so someone developed a hadoop-based caller. When it comes out, GATK has moved to gVCF, which solves the problem in a much better way, at least for deep sequencing. Personally, I would more like to improve algorithms than to adapt my working tools to hadoop.

For some large on-demand services, hadoop from massive cloud computing providers is hugely advantageous over the traditional computing model. Hadoop may also do a better job for certain bioinfo tasks (gVCF merging and de novo assembly coming into my mind). However, for the majority of analyses, hadoop only adds complexity and may even hurt performance.

ADD COMMENT
10
Entering edit mode
10.1 years ago

You are correct in noting most of Hadoop for bioinformatics papers are proofs of concept and real-world use of Hadoop in bioinformatics is quite low.

Hadoop combines two awesome bottlenecks to bring bioinformatics software to its knees - using the network to disperse data and then relying on disk IO to access it (often from the same networked drive).

There are some bioinformatics applications that may benefit from MapReduce but those tend to closely resemble the type of e-commerce problems Hadoop was designed to solve. In most use cases I suspect threaded programs designed for big ass servers would perform better than their Hadoop counterparts.

I am interested to see how the Spark/Avro/Parquet stack performs as it relies much more on RAM, and hence BAS boxes.

ADD COMMENT
4
Entering edit mode
10.1 years ago

Here are some thoughts on this issue from Attila Csordas.

And here are some from myself.

ADD COMMENT
4
Entering edit mode
10.1 years ago
dw314159 ▴ 40

I used Hadoop on a bioinformatics analysis of mRNA complexity. The analysis and results are described at http://badassdatascience.com/2014/05/16/mrna-complexity-by-region/. Source code is provided.

ADD COMMENT
2
Entering edit mode
10.1 years ago
User 59 13k

This blog post from Abishek Tiwari is a couple of years old, but clearly shows there's a number of applications out there using this kind of methodology:

http://abhishek-tiwari.com/post/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers

I imagine there are plenty more in the last couple of years.

ADD COMMENT
0
Entering edit mode

Thanks, it would be nice to know if these papers are just papers or if the described tools also see any real world use (except GATK which uses a mapReduce engine but does not run on Hadoop) . It's the ( lack of ) real world use that I am interested in not just the technical proof of concepts.

ADD REPLY
1
Entering edit mode
10.1 years ago
William ★ 5.3k

I did some more reading. The most promising development for genomics distributed computing world indeed (like lh3 mentioned ) looks to be Adam and the related formats and tool kits:

A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.

Current genomic file formats are not designed for distributed processing. ADAM addresses this by explicitly defining data formats as Apache Avro objects and storing them in Parquet files. Apache Spark is used as the cluster execution system.

Once you convert your BAM file to ADAM, it can be directly accessed by Hadoop Map-Reduce, Spark, Shark, Impala, Pig, Hive, whatever. Using ADAM will unlock your genomic data and make it available to a broader range of systems.

At the moment, we are working on three projects:

  • ADAM: A scalable API & CLI for genome processing
  • bdg-formats: Schemas for genomic data
  • avocado: A Variant Caller, Distributed

http://bdgenomics.org/
https://github.com/bigdatagenomics/adam

I don't (yet ) know if they have they support the full feature set of BWA-Picard-GATK with production quality but it sure looks interesting.

ADD COMMENT
1
Entering edit mode

Wikipedia tells me Spark is up to 100x faster than Hadoop MapReduce, which begs the question: What was holding up Hadoop so much?

ADD REPLY
0
Entering edit mode

I guess hadoop is slow mainly because it uses disks too much. Spark is largely an improved RAM-oriented implementation of hadoop concepts. For the 100X speed up, the wiki links to a paper about shark, which is a spark-based SQL engine. For database queries, in-memory access will be of course faster than disk access by far. For other applications, the speedup may be marginal.

ADD REPLY
0
Entering edit mode

You're right that Spark is faster in some use cases because it writes results to RAM but it's important to add that this is faster for iterative algorithms where the data is going to be used again and again. In these cases Spark reduces speed by reducing need to go all the way to the hard disk and back. Hadoop is still fast (faster?) when data only needs to be written back to disk once. Spark is expected to become more important than Hadoop over time.

ADD REPLY
0
Entering edit mode

With Hadoop 2.0 which implements YARN instead of MapReduce as resource manager (suitable for streaming applications), Bioinformatics should just be the field for it.

ADD REPLY
0
Entering edit mode
9.2 years ago
u1058969 • 0

Hi,

I'm currently using R and Hadoop environment to research bioinformatics.

For me, it's possible to do that if you have knowledge of Linux/ Java/ R/ Hadoop/ Biology and don't need to spend any cost for it because they are open software.

Even you could develop your own packages to optimise the framework for R or Hadoop.

ADD COMMENT

Login before adding your answer.

Traffic: 932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6