Question

Distributed Computing In Bioinformatics

8

Entering edit mode

11.7 years ago

Ngsnewbie ▴ 380

As of now we have some Hadoop based packages (crossbow, cloudburst etc) for NGS data analysis, still I find tools like bowtie, tophat, SOAP etc that people prefer in their work. I am a biologist but still I want to get some ideas that is it possible to use / convert serial tools into map-reduce form to exploit scalelable distributed computing using Hadoop to expedite research? Also what are the challenges in such mapping and assembling algorithms for using them in hadoop system.

I am also curious to know some other bioinformatics task which can done using hadoop based projects like hive, pig and hbase which deals with big data like fastq files, sam, count data or other form of biological data.

ngs • 8.7k views

ADD COMMENT • link updated 11.7 years ago by Jeremy Leipzig 22k • written 11.7 years ago by Ngsnewbie ▴ 380

0

Entering edit mode

Please, explain why you specifically want to use hadoop. You can always parallelize your analysis without a map/reduce process, cloud, etc....

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Actually I am just exploring the hadoop technology, so seeking the challenges or impact of hadoop technologies in NGS / Bioinformatics data analysis. I dont specifically want to use hadoop, but if i try with hadoop , will it be fruitful or not and what hurdles would be there?

ADD REPLY • link 11.7 years ago by Ngsnewbie ▴ 380

score 6 · Answer 1 · 2013-04-05

Well, if you want to explore it, looking at the current bio*-hadoop ecosystem and related fora is a good place to start:

http://hadoopbioinfoapps.quora.com/

There you can find tools like Seal and Hadoop-BAM which target the last part of your question more specifically.

Furthermore, albeit a bit old, the following video & slides still hold as a general view on Hadoop and bioinformatics:

http://vimeo.com/7351342

http://www.slideshare.net/mndoci/hadoop-for-bioinformatics

Last but not least, a couple of my favourite blogs about hadoop, bigdata in biosciences (although not limited to them) are Follow the Data and mypopescu:

http://followthedata.wordpress.com/

http://nosql.mypopescu.com/

Hope that helps!

Ram · Answer 2 · 2013-04-05

Except de novo assembly, the bottleneck of NGS analyses is frequently read mapping and SNP calling. For these analyses, you can trivially split your read files for mapping and chromosomal regions for calling and run jobs separately on different computing nodes. Hadoop adds little benefit in this case while requiring a special set up which might (I am not sure) interfere with other non Hadoop jobs. I also see fewer researchers understanding how hadoop works as a big obstacle.

On the other hand, these concerns with hadoop are relatively minor technically. If you can move the most widely used bwa-picard-gatk pipeline to hadoop, there will be some potential users especially when they rely on amazon. Crossbow and cloudburst are not so popular partly because they are not implementing the best pipeline. Scientists usually choose accuracy over speed/convenience unless the difference in accuracy is negligible while the difference in speed is over a couple of orders of magnitude.

score 5 · Answer 3 · 2013-04-05

5

Entering edit mode

11.7 years ago

Jeremy Leipzig 22k

One of the more compelling uses of Hadoop would be querying variants from thousands of individuals, as illustrated with Seqware here:

enter image description here

http://openi.nlm.nih.gov/detailedresult.php?img=3040528_1471-2105-11-S12-S2-2&req=4

Two caveats stand out:

equivalent BASS hardware (like this 32TB monster from Oracle) will still outperform distributed setups.
In the example above couldn't they have simply divided individuals or variants between 6 machines running BerkeleyDB without being overly clever?

ADD COMMENT • link 11.7 years ago by Jeremy Leipzig 22k

4

Entering edit mode

I remmeber this paper. I think it falls into a typical trap for technical people: trying to put everything such as sam, vcf, bed and wig in a generic database and adding hardware when that does not work. This approach rarely gives satisfactory results in the NGS era. For huge amount of data, we need specialized treatments and occasionally advances in methodology. Such approaches can be orders of magnitude more efficient than a generic database. We had some interaction with a few top google engineers. When we chat about storing many BAMs/VCFs, their reaction was to first design a specialized binary representation, but not to put each record in their BigQuery or similar existing systems. That is the right direction.

ADD REPLY • link 11.7 years ago by lh3 33k