Question

Clustering coordinates based on window size

1

Entering edit mode

6.6 years ago

user_g ▴ 20

Hello,

I am writing a script that can detect clusters of transcription factors found within a certain window size. The code can read bed files and the user has the option to choose how many transcription factor binding sites he wants within this window. The aim is also to allow (or prevent) certain overlap within binding sites or between clusters themselves.

Here is an input example where the user can choose to have 2 sites from TF1 and 1 site from TF2 within a window size of 20bp.

Chrom Start End Name Strand
    2   2   8   T1      +
    2   10  15  T2      +
    2   17  20  T1      +
    2   21  25  T2      +
    2   27  30  T1      +
    2   31  35  T1      +
    2   40  50  T1      +
    2   52  56  T1      +
    2   58  65  T1      +
    2   63  68  T2      +

In case no overlap is allowed between clusters, these are the results :

 Chrom Start  End      Name       Strand    Cluster
     2  2     20    T1,T1,T2        +       Cluster_1
     2  21    35    T1,T1,T2        +       Cluster_2
     2  52    68    T1,T1,T2        +       Cluster_3

In case overlap between cluster is allowed, then :

Chrom Start  End      Name       Strand    Cluster
   2    2     20    T1,T1,T2        +       Cluster_1
   2    17    30    T1,T1,T2        +       Cluster_2
   2    21    35    T1,T1,T2        +       Cluster_3
   2    52    68    T1,T1,T2        +       Cluster_4

I have written the code for this however it is becoming slow when dealing with large data.

Is there any built in method(s) which can be used to cluster genomic intervals efficiently (i.e fast) by window size taking into account distance/overlap between sites?

Any help or hint is appreciated,

Many thanks in advance,

R GenomicRanges IRanges GRanges • 1.5k views

ADD COMMENT • link updated 6.6 years ago by Alex Reynolds 36k • written 6.6 years ago by user_g ▴ 20

score 0 · Answer 1 · 2018-05-09

You could use BEDOPS bedmap with its overlap options:

Overlap Options (At most, one may be selected.  By default, --bp-ovr 1 is used):                
 --------                                                                                       
  --bp-ovr <int>           Require <int> bp overlap between elements of input files.            
  --exact                  First 3 fields from <map-file> must be identical to <ref-file>'s.    
  --fraction-both <val>    Both --fraction-ref <val> and --fraction-map <val> must be true to   
                             qualify as overlapping.  Expect 0 < val <= 1.                      
  --fraction-either <val>  Either --fraction-ref <val> or --fraction-map <val> must be true to  
                             qualify as overlapping.  Expect 0 < val <= 1.                      
  --fraction-map <val>     The fraction of the element's size from <map-file> that must overlap 
                             the element in <ref-file>.  Expect 0 < val <= 1.                   
  --fraction-ref <val>     The fraction of the element's size from <ref-file> that must overlap 
                             an element in <map-file>.  Expect 0 < val <= 1.                    
  --range <int>            Grab <map-file> elements within <int> bp of <ref-file>'s element,    
                             where 0 <= int.  --range 0 is an alias for --bp-ovr 1.

You can use bedops --chop to make 20nt disjoint windows:

$ echo -e "2\t0\t80" | bedops --chop 20 - > windows.bed
$ cat windows.bed
2   0   20
2   20  40
2   40  60
2   60  80

Given a list of TFs:

$ cat tfs.bed
2   2   8   T1      +
2   10  15  T2      +
2   17  20  T1      +
2   21  25  T2      +
2   27  30  T1      +
2   31  35  T1      +
2   40  50  T1      +
2   52  56  T1      +
2   58  65  T1      +
2   63  68  T2      +

You could map them to windows via bedmap --echo-map-id and one of the overlap options described above, if you want to constrain the result.

For example, here is a sample of output when a TF interval must fall entirely within a window (--fraction-map 1):

$ bedmap --fraction-map 1 --echo --echo-map-id --echo-ref-row-id --delim '\t' windows.bed tfs.bed > answer.fm1.bed
$ cat answer.fm1.bed
2   0   20  T1;T2;T1    id-1
2   20  40  T2;T1;T1    id-2
2   40  60  T1;T1   id-3
2   60  80  T2  id-4

If you want to relax this overlap criteria, you can get all TFs that associate with a window, if half of a TF overlaps a window (--fraction-map 0.5):

$ bedmap --fraction-map 0.5 --echo --echo-map-id --echo-ref-row-id --delim '\t' windows.bed tfs.bed > answer.fm0p5.bed
$ cat answer.fm0p5.bed
2   0   20  T1;T2;T1    id-1
2   20  40  T2;T1;T1    id-2
2   40  60  T1;T1   id-3
2   60  80  T1;T2   id-4

If you want to get associations where clusters overlap, then perhaps use staggered windows. Here's an example of using bedops --chop with --stagger and -x to get 20nt windows which overlap by 10nt. The -x option removes any non-20nt sized window at the end:

$ echo -e "2\t0\t80" | bedops --chop 20 --stagger 10 -x - > s.windows.bed
$ cat s.windows.bed
2   0   20
2   10  30
2   20  40
2   30  50
2   40  60
2   50  70
2   60  80

Then repeat the bedmap operations, but use the s.windows.bed file in place of windows.bed:

$ bedmap --fraction-map 1 --echo --echo-map-id --echo-ref-row-id --delim '\t' s.windows.bed tfs.bed
2   0   20  T1;T2;T1    id-1
2   10  30  T2;T1;T2;T1 id-2
2   20  40  T2;T1;T1    id-3
2   30  50  T1;T1   id-4
2   40  60  T1;T1   id-5
2   50  70  T1;T1;T2    id-6
2   60  80  T2  id-7

If you want to preserve strand and other information, use --echo-map in place of --echo-map-id, and use awk or Python scripts to post-process results to get the exact columns you want.

It's not totally clear to me what your window size definitions are, but hopefully the above gives you some ideas of what you could do with existing command-line set overlap toolkits, which are written to work quickly and correctly for whole-genome scale analyses.

I don't have much experience with it, but bedr is an R package that interfaces with BEDOPS and other set overlap kits. You might use that for bedmap operations, if you have to use R.