Here is a solution using cascalog, which provides a high level
query interface on top of Hadoop fully integrated with the
Clojure programming language. By abstracting away the map/reduce
details, it eases implementation and understandability of the
code. You can grab the full code example from GitHub.
First define the query. This is two functions: testing when one
start end coordinate overlaps with a second and writing the cascalog
query that uses this to return only those chromosome start end pairs
in the first that overlap with the second:
(deffilterop overlaps [s1 e1 s2 e2]
"Pass filter if s1,e1 start end pair overlaps s2,e2 pair."
(or (and (>= s1 s2) (< s1 e2))
(and (>= e1 s2) (< e1 e2))))
(defn run-bed-overlap [bed-one bed-two]
"Cascalog query to pull data from two bed files and intersect."
(?<- (stdout) [?chr ?s1 ?e1]
(bed-one ?chr ?s1 ?e1)
(bed-two ?chr ?s2 ?e2) ; Matches chromosome with bed-one
(overlaps ?s1 ?e1 ?s2 ?e2)))
The other part is writing the plumbing to parse BED files and feed
them into the run-bed-overlap
function:
(defmapop parse-bed-line [line]
(take 3 (split line #"\t")))
(defn to-int [x] (Integer/parseInt x))
(defn bed-from-hfs [dir]
"With a BED file from HDFS, produce chromsome, start, end."
(let [source (hfs-textline dir)]
(<- [?chr ?s ?e]
(source ?line)
(parse-bed-line ?line :> ?chr ?s-str ?e-str)
(to-int ?s-str :> ?s)
(to-int ?e-str :> ?e))))
(defn bed-hfs-overlap [dir1 dir2]
(run-bed-overlap (bed-from-hfs dir1) (bed-from-hfs dir2)))
Push your bed files to HDFS and run on Hadoop with:
% cd /path/to/bed-hadoop
% lein deps
% lein uberjar
% hadoop fs -mkdir /tmp/bed-hadoop/bed-1
% hadoop fs -mkdir /tmp/bed-hadoop/bed-2
% hadoop fs -put test/one.bed /tmp/bed-hadoop/bed-1
% hadoop fs -put test/two.bed /tmp/bed-hadoop/bed-2
% hadoop jar bed-hadoop-0.0.1-SNAPSHOT-standalone.jar
bed_hadoop.core /tmp/bed-hadoop/bed-1 /tmp/bed-hadoop/bed-2
RESULTS
----------
chr1 20 30
chr2 40 50
----------
i don't know how anyone writes this stuff (Clojure)