Why is Hadoop not used a lot in bio-informatics? At least in my experience I don't see Hadoop being used at local research groups or at the well known and well funded research groups in the UK and USA. Though Hadoop offers completely distributed IO and CPU power which should be very attractive for large bio-informatics data analysis.
Is it that that the type of files are not suitable for Hadoop? For example a 1000 large binary BAM files each 100GB ? Can Hadoop work with binary files of that size?
Or is it that the common tools like BWA, Picard, HTS-JDK and GATK can't be run natively on Hadoop?
Mapping 1000 FastQ files to Sam files is something which can be done in parallel for every record in the FastQ files and is I think well suited for Hadoop.
But is Mapping entire Bam files to gVCF (mapping as in the functional MapRecure paradigm) something that can be done on Hadoop?
And is reducing (reducing as in the functional MapRecure paradigm) the gVCF files to VCF files something that can be done on Hadoop?
As you might have gathered I only have a limited knowledge of Hadoop and the Hadoop file system ( HDFS) and I am wondering if they are suitable for common genomics data formats and common analysis steps like mapping reads with for example BWA and variant calling with for example GATK Haplotype Caller.
edit:
I found Hadoop-BAM and SeqPig but I am wondering if these are just papers / technical proof of concepts or if they also see any real world use?
You might be interested in this thread if you haven't already seen it Distributed Computing In Bioinformatics