Hi!
I am really confused, and need some expert advice from the guys who work in NGS data analysis along with reason.
Ok, the PROBLEM is:
I performed peak calling using MACS and I got nice peaks(antibody used for detecting ETS factor), then I annotated these peaks to the single nearest genes in a window of 25 Kb. This was followed by De novo motif discovery and also I scanned these regions using for motifs of interest using PWM. Now, I get a motif for ETS factor through De novo descovery and this motif is also pop ups while scanning the region using PWM (in 85% of peaks this motif is present). Lastly, if I look at the annotated genes and querry for the genes having this motif, I get a list of 220 genes.
My QUESTION:
Are these 220 genes being directly or indirectly regulated by these peaks for sure compared to other genes and these might be more interesting than the others? if yes, how will you back it up through bioinformatic and biological angle. what more you would do in this scenario?
Thank you
Background sets were used in both cases of motif discovery (De novo as well as while scanning). Please give your take now.
As Ian wrote in a different answer, it would be helpful if you had access to relevant expression data, microarray or RNA-seq. There is tool, Rcade (http://www.bioconductor.org/packages/2.11/bioc/html/Rcade.html), that attempts to connect ChIP peaks to gene expression using a probabilistic model. (Of course it is also possible to do more straightforward versions of this analysis yourself.) You could use a tool like GREAT mentioned by Ido or the ChIP-seq significance tool (http://encodeqt.stanford.edu/hyper/) to check for correlations with ENCODE data. Perhaps you could search for other motifs that are colocated with your ETS peaks; these could be binding sites for potential interacting TFs.