Hi, I have two ChIP-seq bed files of same transcription factor. I want to compare one specific binding motif between these two ChIP-seq binding peaks, i.e. whether the motif is more enriched in the binding sites of one ChIP-seq vs. the other (under different culture condition)?
The plots I want to make is by X axis: "Distance from centre of ChIP region (bp)" and the Y axis: " No. motif / no. binding sites"
Thank you for your answers!
Xiaoyong
R will plot signal. You just need to specify what signal you want to plot.
Thanks, Alex. My question is actually is which software can do the motif enrichment analysis for my purpose. i.e. estimate how many specific motifs per binding site surrounding the center of peak region (-500bp ~ 500bp). Then plot it to visualize the difference between different ChIP-seq data sets.
You could use FIMO to call binding sites, running the BED output through BEDOPS bedmap to count sites over 1kbase windows of interest. Then use R to plot.
Thanks, Alex. Now I have two bed files, one is the FIMO output bed file defining the specific TF binding regions; other one is the ChIP-seq peak bed file containing the intervals (~300bp) that is used for calculation -- how many binding motifs (also location) within each interval. I want to plot a histogram showing the distribution of the numbers of motifs across a 1Kb window of these centered peak intervals. According to your suggestion, could you write me a script that runs in bedops and R. Many thanks! -Xiaoyong
I'll do the simplest possible example for counting:
This simply counts binding sites that overlap peaks. Note that overlapping peaks may result in double counts. (Peaks are not 1kb windows; I'm just showing the simplest use of the
--count
operator.)To make a simple histogram, I think this would suffice (untested):
You could read
?hist
in R for more details.If you want a 1kb window around midpoints, you would calculate the midpoints of the peaks (say, with
awk
) and pipe that result intobedmap
:Use
-
inbedmap
to represent the standard output fromawk
(the peak midpoints).Add
--range 500
tobedmap --count
to generate a 1kb window around peak midpoints (500 bases up- and downstream of the midpoints). Counts are generated from the 1kb windows.Then make the histogram in R with the new counts and
hist()
Note that overlapping peak windows may result in double counts.
To get the location of binding sites that overlap the midpoints, use
--echo-map
instead of--count
, change the map delimiter to a newline, and pipe tosort-bed
:You could bring this file into a UCSC Genome Browser instance, IGB, or other viewer.
Thank you for your prompt reply. Your commands work very well. However, I am sorry that I didn't express my question well. The plot I want to make is: X-axis is the distance to peak summit (i.e., -500 to 500bp), and the Y-axis is the "number of motifs per number of peaks". In that way, we can appreciate the change of motif density across the distance to the ChIP-seq summit in a 1kb window. Thank you! -Xiaoyong
Then you need to make a set of bins and count across them.
The following uses
bedops --chop 10
to make 10-base wide bins across a 1kb-wide window, which is centered on each peak midpoint. You can change how large or how small the bins are by adjusting the--chop
value.The bins are each fed to
bedmap --count
to count the number of binding sites that overlap the bin:Each 1kb window will contain 100 10-base bins, so the file
countsOfBindingSitesOverBins.bed
will look something like:You have the coordinates of the bins in the first three columns, the count of binding sites across the bin in the fourth column, and the bin ID in the fifth column.
Each 1kb window will contain 100 10-base bins, so this output will have IDs that look like
bin-0
throughbin-99
. A bin containing a peak midpoint should be labeledbin-50
, in this case.It's unlikely you care about the coordinates of the bins, so you can use
cut
to grab the fourth and fifth columns:From here, you can use
read.table()
in R to read in the text filecountsPerBin.txt
. You can make a bar plot withggplot2
or similar:The x-axis shows the bin ID value and the y-axis shows the mean count of binding sites that overlap that bin.
Thank you so much for this comprehensive instruction. I will try and let you know if I have more questions. Best, -Xiaoyong