Question

Clustering and Bedtools

0

Entering edit mode

8.1 years ago

user_g ▴ 20

Hello,

I have bedfiles for 2 TFs and I want to use their start and end sites to cluster them within a certain window base pair. I found cluster function in bedtools however I want to consider 2TF1 and 1TF2 as one cluster ( I wan my cluster to be made of 2parts of TF1 and 1part of the second TF); I couldn't find out how to do this using cluster in bedtools.

Any recommendations or any other tool?

bed bedtools • 3.0k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 8.1 years ago by user_g ▴ 20

0

Entering edit mode

I want my clusters to be made of 2 parts of TF1 and 1 part of the TF2

cluster doesn't work with "parts" : your TF binding sites is either included in a cluster, or not. Also, note that the configuration you ask for (2 TF1 and 1 TF2 binding sites included in a cluster) can not be enforced using clusters. That is because the only parameter you can play with is the distance window. Whether you find that configuration will ultimately depend on your data.

Any recommendations or any other tool?

I guess you could try to use the cluster function, then parse the results to look for your configuration of interest. But this is probably suboptimal...Maybe a bit more information on your goals could help us provide better answers.

ADD REPLY • link 8.1 years ago by Carlo Yague 9.0k

0

Entering edit mode

Oh Thank you Carlo. Well what I intend to do is to detect clusters made from a combination of TF binding sites in a sequence. So lets say I want to detect a cluster made from 1TF binding Site A, and 2TF binding Site B; I want to search the genome sequence to detect where I can find this cluster of binding sites. According to what your saying the cluster method wont be optimal for this case.

ADD REPLY • link 8.1 years ago by user_g ▴ 20

score 0 · Answer 1 · 2017-07-13

To do what you want with cluster :

1) Input file : bed file of the two TFs binding site with a field refering to the TF. ex :

chr1    100 150  TF1
chr1    180 230  TF2
chr1    250 280  TF1
chr1    950 1000 TF2

2) Find clusters within a reasonable window (for instance 500 bp).

bedtools cluster -d 500 -i in.bed > out.bed
cat out.bed
chr1    100 150  TF1     1
chr1    180 230  TF2     1
chr1    250 280  TF1     1
chr1    950 1000 TF2     2

3) check the number of TF1 and TF2 in each cluster. you can do that in bash, python, or even excel. I'll do it in R in this case :

clusters=read.table("out.bed", colClasses =c("factor", "numeric", "numeric", "factor", "factor"), col.names = c("chrom", "start", "end", "TF", "clust"))
by(clusters$TF, clusters$clust, summary)

clusters$clust: 1
TF1 TF2 
  2   1 
---------------------------------------------------------------------------------- 
clusters$clust: 2
TF1 TF2 
  0   1

4) Find the clusters that have 2 TF1 in it and one TF2 in it. Once again, I'll use R, but you can use something else.

by(clusters$TF, clusters$clust, function(x) (summary(x)[1]==2 & summary(x)[2]==1))

clusters$clust: 1
[1] TRUE
---------------------------------------------------------------------------------- 
clusters$clust: 2
[1] FALSE