Question

How To Assign A Chip-Chip/Chip-Seq Peak To A Target Gene?

8

Entering edit mode

14.0 years ago

Casey Bergman 18k

Given a set of bound regions for transcription factor identified by ChIP-chip or ChIP-seq, how do you find the regulated target gene?

AFAIK, the method of choice seems to answer this question seems to be the ad hoc approach to "find the nearest gene or TSS", as implemented in the ChIPpeakAnno Bioconductor package. But since cis-regulatory elements can skip over "bystander" genes and act on non-neighboring genes, clearly something more sophisticated is needed to generate better target gene assignment.

I've seen one paper that integrates (i) distance to genes with (ii) expression data from knockout studies and (iii) prior data to prioritize target genes for bound regions, but there is no code available on their webpage. This group does provide a web application to store and browse target gene assignment, but I was hoping to find additional code that does this automatically. Additional papers outlining strategies that can solve this task would be welcome as well.

chip-seq chip-seq target papers • 11k views

ADD COMMENT • link updated 14.0 years ago by 2184687-1231-83- ★ 5.1k • written 14.0 years ago by Casey Bergman 18k

score 10 · Answer 1 · 2011-01-10

10

Entering edit mode

14.0 years ago

Lars Juhl Jensen 11k

I am the person guilty of the method described on the Furlong Lab web site. I could dig up the scripts for you, but I think you would be better off reimplementing it. The scripts are written in Perl, and it clearly shows that I was playing around trying to come up with method that would work well rather than knowing up front how to go about it. What is needed is thus really a complete rewrite (perhaps in R to make it more easily usable for the array community) and not just a code cleanup.

The idea behind the method is really quite simple. You calculate separate scores for each kind of evidence for each gene and multiply them up. The score for the ChIP data is calculated from the distance between the gene and the closest TF binding site identified, using a sigmoid (or something similar) to assign a perfect score of 1 for genes close to a binding site, gradually dropping off, and a score of 0 to genes beyond some distance. For the expression data, qvalues were similarly converted to scores between 0 and 1 (I think the formula was score=1-4*qvalue).

EDIT:

The original Perl scripts can be found here:

score_chip.pl (script to calculate ChIP-chip subscores)
score_expr.pl (script to calculate expression subscores)
score_comb.pl (script to calculate combined scores from the above)

They are a bit too large and ugly to put as code blocks, so I deposited them on Box.net instead.

ADD COMMENT • link 14.0 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

Thanks Lars. Posting the code would be useful for us, and perhaps others as well, so if you can dig out a copy that would be much appreciated.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

Scripts added as requested. The software is provided "as is" without warranty of any kind, express or implied, including the warranties of merchantability, fitness for a particular purpose, noninfringement and sanity after trying to understand it ;-)

ADD REPLY • link 14.0 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

This is great! Everything is more or less clear, and you are right - a rewrite in R would make a great project.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

Am I right if I say that this method doesn't define CRMs for groups of peaks? And if one only has chip-seq data and no expression values, will then be essentially just associating peaks to nearest TSS based on distance?

ADD REPLY • link 13.0 years ago by Ahdf-Lell-Kocks ★ 1.6k

0

Entering edit mode

Yes, that is correct - the whole point of the method was to combine ChIP and expression data.

ADD REPLY • link 13.0 years ago by Lars Juhl Jensen 11k

Ram · Answer 2 · 2012-01-11

6

Entering edit mode

13.0 years ago

2184687-1231-83- ★ 5.1k

Review paper on CRMs, not target gene assignment (thx Casey):

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001020

and recently published:

http://bioinformatics.oxfordjournals.org/content/27/23/3221.abstract

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.0 years ago by 2184687-1231-83- ★ 5.1k

1

Entering edit mode

The first paper is about CRM prediction evaluation, not target gene assignment, but the second one looks very relevant. Many thanks!

ADD REPLY • link 13.0 years ago by Casey Bergman 18k

Ram · Answer 3 · 2011-01-12

5

Entering edit mode

14.0 years ago

Maximilian Haeussler ★ 1.7k

Here is a similar method recently published in NAR: they use cross-species synteny, GO similarity TF/the flanking gene and the distance between TF and flanking gene in a protein-protein-interaction network. They then train a random-forest classifier (whatever that is, looks like decision tree) from this data on a manual test set using these attributes and say that it performs better than the closest-gene approach.

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 14.0 years ago by Maximilian Haeussler ★ 1.7k

0

Entering edit mode

I don't see a link to the code in the paper. Is this available somewhere or upon request to the authors?

ADD REPLY • link 13.0 years ago by Ahdf-Lell-Kocks ★ 1.6k

score 2 · Answer 4 · 2011-05-10

Two options you may want to consider are relying upon association by eQTL studies and GRAIL analysis. There are a few eQTL datasets available (expressed quantitative trait locus) which link SNPs to changes in gene expression. I'm sure you could use genomic position or find SNPs covered in your peaks. Of course, cis-regulation is cell-type and stimulus-type specific so you may be in trouble if you have the wrong cell type. GRAIL (Gene Relationships Across Implicated Loci) is another good alternative if you want to assay for commonality between implicated, close-by genes.

Now, in my opinion from a molecular biology upbringing, associations based upon eQTL studies and proximity still require experimental validation before you can be sure any particular transcription factor binding site has function. Consider knocking down/out the transcription factor or overexpressing the transcription factor and assaying for expression.