Given a set of bound regions for transcription factor identified by ChIP-chip or ChIP-seq, how do you find the regulated target gene?
AFAIK, the method of choice seems to answer this question seems to be the ad hoc approach to "find the nearest gene or TSS", as implemented in the ChIPpeakAnno Bioconductor package. But since cis-regulatory elements can skip over "bystander" genes and act on non-neighboring genes, clearly something more sophisticated is needed to generate better target gene assignment.
I've seen one paper that integrates (i) distance to genes with (ii) expression data from knockout studies and (iii) prior data to prioritize target genes for bound regions, but there is no code available on their webpage. This group does provide a web application to store and browse target gene assignment, but I was hoping to find additional code that does this automatically. Additional papers outlining strategies that can solve this task would be welcome as well.
I am the person guilty of the method described on the Furlong Lab web site. I could dig up the scripts for you, but I think you would be better off reimplementing it. The scripts are written in Perl, and it clearly shows that I was playing around trying to come up with method that would work well rather than knowing up front how to go about it. What is needed is thus really a complete rewrite (perhaps in R to make it more easily usable for the array community) and not just a code cleanup.
The idea behind the method is really quite simple. You calculate separate scores for each kind of evidence for each gene and multiply them up. The score for the ChIP data is calculated from the distance between the gene and the closest TF binding site identified, using a sigmoid (or something similar) to assign a perfect score of 1 for genes close to a binding site, gradually dropping off, and a score of 0 to genes beyond some distance. For the expression data, qvalues were similarly converted to scores between 0 and 1 (I think the formula was score=1-4*qvalue).
EDIT:
The original Perl scripts can be found here:
score_chip.pl (script to calculate ChIP-chip subscores)
score_expr.pl (script to calculate expression subscores)
score_comb.pl (script to calculate combined scores from the above)
They are a bit too large and ugly to put as code blocks, so I deposited them on Box.net instead.
Scripts added as requested. The software is provided "as is" without warranty of any kind, express or implied, including the warranties of merchantability, fitness for a particular purpose, noninfringement and sanity after trying to understand it ;-)
Am I right if I say that this method doesn't define CRMs for groups of peaks? And if one only has chip-seq data and no expression values, will then be essentially just associating peaks to nearest TSS based on distance?
Here is a similar method recently published in NAR: they use cross-species synteny, GO similarity TF/the flanking gene and the distance between TF and flanking gene in a protein-protein-interaction network. They then train a random-forest classifier (whatever that is, looks like decision tree) from this data on a manual test set using these attributes and say that it performs better than the closest-gene approach.
Two options you may want to consider are relying upon association by eQTL studies and GRAIL analysis. There are a few eQTL datasets available (expressed quantitative trait locus) which link SNPs to changes in gene expression. I'm sure you could use genomic position or find SNPs covered in your peaks. Of course, cis-regulation is cell-type and stimulus-type specific so you may be in trouble if you have the wrong cell type. GRAIL (Gene Relationships Across Implicated Loci) is another good alternative if you want to assay for commonality between implicated, close-by genes.
Now, in my opinion from a molecular biology upbringing, associations based upon eQTL studies and proximity still require experimental validation before you can be sure any particular transcription factor binding site has function. Consider knocking down/out the transcription factor or overexpressing the transcription factor and assaying for expression.
Thanks for the suggestions. I agree that GWAS/eQTL+ChIP-seq may be a powerful combination of approaches in the future, though I'm afraid that he GRAIL approach may not give direct enough links between TFs and their targets.
Thanks Lars. Posting the code would be useful for us, and perhaps others as well, so if you can dig out a copy that would be much appreciated.
Scripts added as requested. The software is provided "as is" without warranty of any kind, express or implied, including the warranties of merchantability, fitness for a particular purpose, noninfringement and sanity after trying to understand it ;-)
This is great! Everything is more or less clear, and you are right - a rewrite in R would make a great project.
Am I right if I say that this method doesn't define CRMs for groups of peaks? And if one only has chip-seq data and no expression values, will then be essentially just associating peaks to nearest TSS based on distance?
Yes, that is correct - the whole point of the method was to combine ChIP and expression data.