Question

Annotating Chip Seq: How To Get Enrichment Over Random Background

8

Entering edit mode

12.8 years ago

Ying W ★ 4.3k

I was using HOMER and bedtools (annotateBED with tables from UCSC) to annotate chip-seq peaks and make diagrams where you have X peaks in promoter/intron/exon/upstream/dowstream/etc. However, I would like to create some 'random' reads in the genome to compare against to see if there is enrichment in my chip-seq experiment versus random reads in the genome.

I would like to know 3 things:

Is there a program out there that would spit out random regions of the human genome?
Have there been statistical tests developed specifically for this comparison?
What are some of the other annotation programs that people use?

For the 1st point, I know it is not hard to write something that would dump out 100-mers across the genome randomly (since i am using +/-50bp from chip-seq peak) but the better way to do this would be to take into account mappibility and gc bias right? I am using 36bp Illumina reads but I'm not really looking for a Illumina read simulator. I don't really want to reinvent the wheel since I'm sure somebody out there has already solved this and was hoping someone here would point me towards the right direction.

For the 2nd point, I was thinking of using a t-test for each one of the categories (upstream/downstream/exon/promoter/etc) and correcting for multiple testing. So I would compare # of 100bp windows overlapping promoter identified by chip-seq vs the random ones that I simulated, am I oversimplifying things?

For my last point, I know other programs such as CESA exist and I was wondering what people use / think of these different programs for annotating and what people do when a peak is near two genes (say promoter of one gene and downstream of another gene) do you double count the peak?

chip-seq annotation • 7.8k views

ADD COMMENT • link updated 12.8 years ago by Ian 6.1k • written 12.8 years ago by Ying W ★ 4.3k

score 4 · Answer 1 · 2012-03-22

These are really good questions!

Point 1:

In the past i have pulled out regions of a standard size from the genome and filtered them to only keep mappable regions (using UCSC CRG mappability tracks with an appropriate length, in my case 50bp to match 50bp reads from a SOLiD machine). This is not perfect though, as it does not take into account GC content and other sequence specific factors.

An alternative is to get regions that more specifically mirror the ones from your experiment. [I have just thought of this - so beware] [GimmeMotifs]2 has a script generate_background_sequences.py) that takes the coordinates of your regions and returns sequence that is "genome-matched", i.e. comes from a similar distance to a gene. However, it returns sequence, not coordinates, so a BLAT or something similar would be required to get back to coordinates.

Point 2:

I wouldn't use a T-test as it is a parametric test assuming a normal distribution of data and requires sufficient replicates. I usually use a Z-score, i.e. observed overlap of regions, compared to the expected (using large number of random sets of regions).

There is the Genome Hyperbrowser that can perform this type of calculation for you, but personally i find the selection of a null hypothesis a little confusing.

Point 3:

GREAT does an excellent job of association genes to binding regions (choice of three models, but does not output the associations; the results are used for gene function analyses, e.g. GO term enrichment.

In the past i have used GALAXY to fetch the closest RefSeq transcript TSS or edge to each binding region.

Recently our group has developed a script RnaChipIntegrator to associate genes to binding regions using a variety of different parameters, e.g. top 4 genes in proximity to a custom length promoter. (The RNAChIP portion is historical as the script can be used to associate ChIP-seq regions to genes, RNA-seq data to ChIP-seq, etc).

I hope some of this long ramble helps!