Question

count the number of transcription factor binding sites

1

Entering edit mode

10.8 years ago

Jessica ▴ 70

Hi all,

Given ChIP-Seq data of a transcription factor, what tools are used to count the number of binding sites of the transcription factor in the whole genome?

Thanks,
Jessica

sequencing • 4.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 10.8 years ago by Jessica ▴ 70

0

Entering edit mode

what is the form of your data? bam? bed?

ADD REPLY • link 10.8 years ago by Ming Tommy Tang ★ 4.6k

0

Entering edit mode

It is in the bed format.

ADD REPLY • link 10.8 years ago by Jessica ▴ 70

0

Entering edit mode

so, it is already a peak file. then, each line is a putative binding site. I do not quite understand your question, please state more clearly.

ADD REPLY • link 10.8 years ago by Ming Tommy Tang ★ 4.6k

Ram · Answer 1 · 2014-07-14

I don't think your question has a closed and easy answer. In a perfect world, you run your chipseq data (as aligned reads, bam or bed) through a peak caller, e.g. macs, and each region identified is a binding site, as mentioned by tangming2005.

However, the situation is typically far from perfect for a number of reasons.

The ChIP enrichment is often quite aspecific and noisy, depending on the quality of the antibody. Consider that it's not unusual to have >90% of the reads in the background, i.e. not in peaks.
Some genomic regions tend to be enriched with whatever antibody you use (an artifact that might be due to the way the reference genome is assembled, especially with respect to repetitive regions).
Different peak callers/algorithms might give different numbers of peaks, this difference can even be orders of magnitude. Same goes for using different parameters within the same peak caller
Typically, the more you sequence the more peaks you identify because small bumps that become significant.
Even if the ChIP works perfectly and the peak callers are ideal, there might be opportunistic sites where the transcription factor binds without having much biological relevance (as an aside, possibly related: some chipseq experiments generate many more peaks than genes in the whole genome).

In practice, you could consider as "true" binding sites the peaks which are identified in different replicates and/or which overlap a known sequence motif recognized by your transcription factor (see also the irreproducible discovery rate).

In my opinion, asking "Where are the binding sites?" is not fruitful for the problems above. Better is to ask which binding sites differ between conditions (might be treatments, stages, tissues whatever). This way the quirks associated to ChIP, peak callers etc are averaged out across replicates and conditions.

Ram · Answer 2 · 2015-07-03

5

Entering edit mode

9.8 years ago

Kamil ★ 2.3k

You might be interested to read my tutorial on how to use CENTIPEDE to determine if a transcription factor is bound to a genomic site by making use of DNase-Seq data.

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Kamil ★ 2.3k

0

Entering edit mode

Thanks for the tutorial!

Ming

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.8 years ago by Ming Tommy Tang ★ 4.6k

Ram · Answer 3 · 2015-10-23

0

Entering edit mode

9.5 years ago

Fidel ★ 2.0k

A practical way to decide if your peak is a true peak and not an unspecific binding is to check if there is a motif associated to your transcription factor at the peak. This can be done using the meme suite. Of course, this solution assumes that your ChIP is for a protein that directly binds the DNA.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Fidel ★ 2.0k

0

Entering edit mode

You could use FIMO in the MEME suite to scan for motif models (JASPAR, etc.) across your genome of interest. Take the search result and convert it to a BED file. Then do set operations with BEDOPS tools (like bedmap) to find putative TF binding sites that overlap your ChIP-seq peaks.

ADD REPLY • link 9.5 years ago by Alex Reynolds 36k