I have chip-seq triplicate (3 treatment, 3 controls, each of the 6 with their input control). I have identified the peaks using macs14 for each sample and its input control. Than I performed differential binding analysis using diffBind. It produced a set of (merged) peaks (peak consensus). Now I would like to proceed with the motif discovery using meme-chip or rsat.
What is methodically most sound way to go for motif discovery from the merged peaks/peak consensus?
AFAIK meme-chip/rsat expect relatively narrow summit sequences whereas diffBind merges peaks and so produces longer peak sequences. Shall I de-merge the consensus peaks? Or merge all treatment samples into a single sample and define the peaks from it? My uncertainty stems from the fact that motif discovery tools seem to expect a single sample, rather than a set of replicates each introducing some noise and variation in the peak location (and some lacking some of the peaks altogether).
I think you may need to consider what you know or assume about the protein first. Do you know/assume the protein you ChIP'd binds DNA directly, and if so, do you think it binds a specific motif, rather than say a more nebulus stretch of DNA enriched for some nucleotide (CpG for example) that isn't a motif per se? If the protein binds DNA directly, and binds specific motif, it will either be bound or unbound in various replicates. If it binds another factor, isn't constrained by a particular motif, or is mobile (could bind to a nucleosome, for example) you could expect differences among conditions to be manifest in both bound vs unbound as well as where the protein is bound. I think, if you assume the protein binds a specific motif, and is constrained by this motif, than it may not matter how you find the motif, your methodology may only alter how many motifs are discovered. The motif should exist in the data from the individual replicates, as well as the merged peaks, because it is the motif after all that is directing binding. However, if the protein can be expected to shift positions within a particular underlying window, than the intersection of the merged peaks may well represent the DNA between two bound locations that isn't itself actually bound, if that makes sense?
Have you checked Irreproducibility Discovery Rate (IDR)?
Thanks, I did - but this is not what I asked. I used DiffBind precisely in order to take care of the variation (i.e. in lieu of IDR). What I asked is: I have peaks that have been reproduced - i.e. overlap in multiple samples. Yet they are not absolutely identical - consensus peaks are built from the overlapping peaks which are, essentially, the UNION of the overlapping peaks. So the question is: Shall I search for a motif using the consensus peaks (i.e. in the UNION of the overlapping peaks) or in their INTERSECTION? Or shall I, perhaps, split the consensus peaks into its constituent peaks? Submitting the consensus or the intersection is the least hassle-free solution but I wonder what would be the most methodically sound solution.