You could use BEDOPS bedmap
with its overlap options:
Overlap Options (At most, one may be selected. By default, --bp-ovr 1 is used):
--------
--bp-ovr <int> Require <int> bp overlap between elements of input files.
--exact First 3 fields from <map-file> must be identical to <ref-file>'s.
--fraction-both <val> Both --fraction-ref <val> and --fraction-map <val> must be true to
qualify as overlapping. Expect 0 < val <= 1.
--fraction-either <val> Either --fraction-ref <val> or --fraction-map <val> must be true to
qualify as overlapping. Expect 0 < val <= 1.
--fraction-map <val> The fraction of the element's size from <map-file> that must overlap
the element in <ref-file>. Expect 0 < val <= 1.
--fraction-ref <val> The fraction of the element's size from <ref-file> that must overlap
an element in <map-file>. Expect 0 < val <= 1.
--range <int> Grab <map-file> elements within <int> bp of <ref-file>'s element,
where 0 <= int. --range 0 is an alias for --bp-ovr 1.
You can use bedops --chop
to make 20nt disjoint windows:
$ echo -e "2\t0\t80" | bedops --chop 20 - > windows.bed
$ cat windows.bed
2 0 20
2 20 40
2 40 60
2 60 80
Given a list of TFs:
$ cat tfs.bed
2 2 8 T1 +
2 10 15 T2 +
2 17 20 T1 +
2 21 25 T2 +
2 27 30 T1 +
2 31 35 T1 +
2 40 50 T1 +
2 52 56 T1 +
2 58 65 T1 +
2 63 68 T2 +
You could map them to windows via bedmap --echo-map-id
and one of the overlap options described above, if you want to constrain the result.
For example, here is a sample of output when a TF interval must fall entirely within a window (--fraction-map 1
):
$ bedmap --fraction-map 1 --echo --echo-map-id --echo-ref-row-id --delim '\t' windows.bed tfs.bed > answer.fm1.bed
$ cat answer.fm1.bed
2 0 20 T1;T2;T1 id-1
2 20 40 T2;T1;T1 id-2
2 40 60 T1;T1 id-3
2 60 80 T2 id-4
If you want to relax this overlap criteria, you can get all TFs that associate with a window, if half of a TF overlaps a window (--fraction-map 0.5
):
$ bedmap --fraction-map 0.5 --echo --echo-map-id --echo-ref-row-id --delim '\t' windows.bed tfs.bed > answer.fm0p5.bed
$ cat answer.fm0p5.bed
2 0 20 T1;T2;T1 id-1
2 20 40 T2;T1;T1 id-2
2 40 60 T1;T1 id-3
2 60 80 T1;T2 id-4
If you want to get associations where clusters overlap, then perhaps use staggered windows. Here's an example of using bedops --chop
with --stagger
and -x
to get 20nt windows which overlap by 10nt. The -x
option removes any non-20nt sized window at the end:
$ echo -e "2\t0\t80" | bedops --chop 20 --stagger 10 -x - > s.windows.bed
$ cat s.windows.bed
2 0 20
2 10 30
2 20 40
2 30 50
2 40 60
2 50 70
2 60 80
Then repeat the bedmap
operations, but use the s.windows.bed
file in place of windows.bed
:
$ bedmap --fraction-map 1 --echo --echo-map-id --echo-ref-row-id --delim '\t' s.windows.bed tfs.bed
2 0 20 T1;T2;T1 id-1
2 10 30 T2;T1;T2;T1 id-2
2 20 40 T2;T1;T1 id-3
2 30 50 T1;T1 id-4
2 40 60 T1;T1 id-5
2 50 70 T1;T1;T2 id-6
2 60 80 T2 id-7
If you want to preserve strand and other information, use --echo-map
in place of --echo-map-id
, and use awk
or Python scripts to post-process results to get the exact columns you want.
It's not totally clear to me what your window size definitions are, but hopefully the above gives you some ideas of what you could do with existing command-line set overlap toolkits, which are written to work quickly and correctly for whole-genome scale analyses.
I don't have much experience with it, but bedr is an R package that interfaces with BEDOPS and other set overlap kits. You might use that for bedmap
operations, if you have to use R.
Thank you! I will try it