Distributed Computing In Bioinformatics
3
8
Entering edit mode
11.7 years ago
Ngsnewbie ▴ 380

As of now we have some Hadoop based packages (crossbow, cloudburst etc) for NGS data analysis, still I find tools like bowtie, tophat, SOAP etc that people prefer in their work. I am a biologist but still I want to get some ideas that is it possible to use / convert serial tools into map-reduce form to exploit scalelable distributed computing using Hadoop to expedite research? Also what are the challenges in such mapping and assembling algorithms for using them in hadoop system.

I am also curious to know some other bioinformatics task which can done using hadoop based projects like hive, pig and hbase which deals with big data like fastq files, sam, count data or other form of biological data.

ngs • 8.7k views
ADD COMMENT
0
Entering edit mode

Please, explain why you specifically want to use hadoop. You can always parallelize your analysis without a map/reduce process, cloud, etc....

ADD REPLY
0
Entering edit mode

Actually I am just exploring the hadoop technology, so seeking the challenges or impact of hadoop technologies in NGS / Bioinformatics data analysis. I dont specifically want to use hadoop, but if i try with hadoop , will it be fruitful or not and what hurdles would be there?

ADD REPLY
6
Entering edit mode
11.7 years ago

Well, if you want to explore it, looking at the current bio*-hadoop ecosystem and related fora is a good place to start:

http://hadoopbioinfoapps.quora.com/

There you can find tools like Seal and Hadoop-BAM which target the last part of your question more specifically.

Furthermore, albeit a bit old, the following video & slides still hold as a general view on Hadoop and bioinformatics:

http://vimeo.com/7351342

http://www.slideshare.net/mndoci/hadoop-for-bioinformatics

Last but not least, a couple of my favourite blogs about hadoop, bigdata in biosciences (although not limited to them) are Follow the Data and mypopescu:

http://followthedata.wordpress.com/

http://nosql.mypopescu.com/

Hope that helps!

ADD COMMENT
6
Entering edit mode
11.7 years ago
lh3 33k

Except de novo assembly, the bottleneck of NGS analyses is frequently read mapping and SNP calling. For these analyses, you can trivially split your read files for mapping and chromosomal regions for calling and run jobs separately on different computing nodes. Hadoop adds little benefit in this case while requiring a special set up which might (I am not sure) interfere with other non Hadoop jobs. I also see fewer researchers understanding how hadoop works as a big obstacle.

On the other hand, these concerns with hadoop are relatively minor technically. If you can move the most widely used bwa-picard-gatk pipeline to hadoop, there will be some potential users especially when they rely on amazon. Crossbow and cloudburst are not so popular partly because they are not implementing the best pipeline. Scientists usually choose accuracy over speed/convenience unless the difference in accuracy is negligible while the difference in speed is over a couple of orders of magnitude.

ADD COMMENT
5
Entering edit mode
11.7 years ago

One of the more compelling uses of Hadoop would be querying variants from thousands of individuals, as illustrated with Seqware here:

enter image description here

http://openi.nlm.nih.gov/detailedresult.php?img=3040528_1471-2105-11-S12-S2-2&req=4

Two caveats stand out:

  1. equivalent BASS hardware (like this 32TB monster from Oracle) will still outperform distributed setups.

  2. In the example above couldn't they have simply divided individuals or variants between 6 machines running BerkeleyDB without being overly clever?

ADD COMMENT
4
Entering edit mode

I remmeber this paper. I think it falls into a typical trap for technical people: trying to put everything such as sam, vcf, bed and wig in a generic database and adding hardware when that does not work. This approach rarely gives satisfactory results in the NGS era. For huge amount of data, we need specialized treatments and occasionally advances in methodology. Such approaches can be orders of magnitude more efficient than a generic database. We had some interaction with a few top google engineers. When we chat about storing many BAMs/VCFs, their reaction was to first design a specialized binary representation, but not to put each record in their BigQuery or similar existing systems. That is the right direction.

ADD REPLY

Login before adding your answer.

Traffic: 2087 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6