To do what you want with cluster :
1) Input file : bed file of the two TFs binding site with a field refering to the TF. ex :
chr1 100 150 TF1
chr1 180 230 TF2
chr1 250 280 TF1
chr1 950 1000 TF2
2) Find clusters within a reasonable window (for instance 500 bp).
bedtools cluster -d 500 -i in.bed > out.bed
cat out.bed
chr1 100 150 TF1 1
chr1 180 230 TF2 1
chr1 250 280 TF1 1
chr1 950 1000 TF2 2
3) check the number of TF1 and TF2 in each cluster. you can do that in bash, python, or even excel. I'll do it in R in this case :
clusters=read.table("out.bed", colClasses =c("factor", "numeric", "numeric", "factor", "factor"), col.names = c("chrom", "start", "end", "TF", "clust"))
by(clusters$TF, clusters$clust, summary)
clusters$clust: 1
TF1 TF2
2 1
----------------------------------------------------------------------------------
clusters$clust: 2
TF1 TF2
0 1
4) Find the clusters that have 2 TF1 in it and one TF2 in it. Once again, I'll use R, but you can use something else.
by(clusters$TF, clusters$clust, function(x) (summary(x)[1]==2 & summary(x)[2]==1))
clusters$clust: 1
[1] TRUE
----------------------------------------------------------------------------------
clusters$clust: 2
[1] FALSE
cluster
doesn't work with "parts" : your TF binding sites is either included in a cluster, or not. Also, note that the configuration you ask for (2 TF1 and 1 TF2 binding sites included in a cluster) can not be enforced using clusters. That is because the only parameter you can play with is the distance window. Whether you find that configuration will ultimately depend on your data.I guess you could try to use the
cluster
function, then parse the results to look for your configuration of interest. But this is probably suboptimal...Maybe a bit more information on your goals could help us provide better answers.Oh Thank you Carlo. Well what I intend to do is to detect clusters made from a combination of TF binding sites in a sequence. So lets say I want to detect a cluster made from 1TF binding Site A, and 2TF binding Site B; I want to search the genome sequence to detect where I can find this cluster of binding sites. According to what your saying the cluster method wont be optimal for this case.