Question

Сalculating fold-enrichment of ChIP-seq peaks intersecting with promoters (vs. genome average)

18

Entering edit mode

10.3 years ago

chemcehn ▴ 210

Hello,

I need to calculate for my BED file with ChIP-seq peaks fold-enrichment of genomic regions containing or intersecting with promoters (versus genome average). How to do this?

I have tried GENOMATIX, which has this option, but it seems that it calculates fold-enrichment incorrectly (does not take into account different lengths of my genomic regions). I have also tried GREAT and CHIP-ENRICH, but what they both calculate is something like the distribution from each peak's midpoint to the nearest gene, which is not what I need.

Do you know any software to do this task?

PS. I do not need to calculate the read enrichment. I need to calculate the enrichment of peaks contained in my BED file, which intersect with promoters, versus the amount of peaks of this size which would intersect with promoters by chance based on genome-average probability to encounter a promoter.

Like this: My reads contain XX% of reads intersecting with promoters. It is expected that one would get YY% of reads of this length intersecting with promoters by chance, and therefore, the enrichment of promoters is equal to XX/YY-fold in comparison with what one would get by chance. So, I am looking for the solution of this problem. I am pretty sure it is realized already in some software, please advise where!

Thanks

sequencing next-gen ChIP-Seq genome gene • 19k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.3 years ago by chemcehn ▴ 210

Ram · Answer 1 · 2015-01-16

18

Entering edit mode

10.3 years ago

Ryan Dale 5.0k

There are several existing options:

bedtools jaccard (command line; bp-based)
bedtools fisher (command line; interval-based)
IntervalStats (command line)
permutation testing with pybedtools (Python)
Genometricorr (R package)
poverlap (Python/command line)
GAT (Python/command line)
regioneR (R)

Also check out The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment which gives a lot of good insight and has a lot of references to read up on.

For what it's worth, I've tried many of these (haven't tried GAT or regioneR) and have settled on IntervalStats. Since it's an asymmetric metric, it has the potential to give more biological insight. And hopefully it avoids bias from permutation strategy choice.

(edit: added GAT, and clarified that I haven't tried it yet)

(edit 2: added bedtools fisher as an interval-based alternative to the bp-based bedtools jaccard, as per @brentp's comment on another answer)

(edit 3: added regioneR as per @bernatgel's answer)

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 10.3 years ago by Ryan Dale 5.0k

1

Entering edit mode

Nice suggestions, Ryan. @brentp and I have been discussion the IntervalStats approach from Chikina and Troyanskaya, as I have been interested in efficient ways to calculate the denominator of their P-value that is computed for each interval. Because I am lazy, how efficient is the existing implementation? Have you tried to implement this in pybedtools?

ADD REPLY • link 10.3 years ago by Aaronquinlan 12k

1

Entering edit mode

I haven't tried to implement in pybedtools yet. It's been a while since I've run it, but I remember that one pairwise comparison took about as long as a 1000 iteration permutation test in pybedtools. I haven't looked closely at the actual implementation though, so I have no idea if/how much it can be improved.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ryan Dale 5.0k

1

Entering edit mode

Also, GAT has been suggested as another approach that uses simulation:

@aaronquinlan not mentioned but GAT does this http://t.co/Gz94iW9EOv using simulation
— Nicholas E Ilott (@IlottNick) January 16, 2015

ADD REPLY • link 10.3 years ago by Aaronquinlan 12k

0

Entering edit mode

Thanks, didn't know about GAT. I added it to the list in the answer, and I'll have to try it out.

ADD REPLY • link 10.3 years ago by Ryan Dale 5.0k

0

Entering edit mode

thanks for plugging poverlap. it needs some love, but it is pretty flexible and can handle a lot of different null models. I wrote it while we were working on the paper you linked to.

ADD REPLY • link 10.3 years ago by brentp 24k

0

Entering edit mode

@Ryan, Thank you very much for such an extensive list! I will try bedtools jaccard first.

@Aaron, Is there a way to call promoters for a given genome through bedtools, or do I need to create a file with promoters myself?

ADD REPLY • link 10.3 years ago by chemcehn ▴ 210

Ram · Answer 2 · 2015-01-16

EDIT: After the OP edited his/her question my answer is no longer relevant, anyway in case someone is interested here it is...:

I have a script localEnrichmentBed.py which might be useful for you. It calculates the enrichment (as log2 fold change) and significance (p value from Fisher test) in target regions, promoters in your case, relative to the regions flanking the targets. So it doesn't use the genomic background but a local background.

I don't know what you want to do downstream but using the whole genome as background (effectively normalizing for library size) is going to give bias results read coverage can be quite uneven.

This is from the program's help:

localEnrichmentBed.py -h
usage: localEnrichmentBed.py [-h] --target TARGET --bam BAM --genome GENOME
                             [--slop SLOP] [--blacklist BLACKLIST]
                             [--tmpdir TMPDIR] [--keeptmp] [--verbose]
                             [--version]

DESCRIPTION
    Compute the read enrichment in target intervals relative to local background.

    Typical use case: A ChIP-Seq experiment on a sample returns a number of regions
    of enrichment. We want to know how enriched these regions are in a *different*
    sample. Note that enrichment is quantified relative to the local background
    not relative to an input control.

    See also localEnrichmentScore.R to combine replicates and compare
    treatment vs control.

OUTPUT:
    bed file with header and columns:
1. chrom
2. start
3. end
4. targetID
5. flank_cnt
6. target_cnt
7. flank_len
8. target_len
9. log10_pval
10. log2fc

EXAMPLE
    localEnrichmentBed.py -b rhh047.bam -t rhh047.macs_peaks.bed -g genome.fa.fai -bl blacklist.bed > out.bed

    ## Using pipes:
    samtools view -u rhh047.bam chr18 \
    | localEnrichmentBed.py -b - -t rhh047.macs_peaks.bed -g genome.fa.fai > out.bed


Useful tip: Get genome file from bam file:

    samtools view -H rhh047.bam \
    | grep -P "@SQ\tSN:" \
    | sed 's/@SQ\tSN://' \
    | sed 's/\tLN:/\t/' > genome.txt

REQUIRES:
    - bedtools suite
    - numpy, scipy

NOTES:
For PE reads, the second read in pair is excluded (by samtools view -F 128)
since coverageBed double counts pairs.

optional arguments:
  -h, --help            show this help message and exit
  --target TARGET, -t TARGET
                        Target bed file where enrichment is to be computed.
                        Use - to read from stdin.

  --bam BAM, -b BAM     Bam file of the library for which enrichment is to be
                        computed. Use - to read from stdin.

  --genome GENOME, -g GENOME
                        A genome file giving the length of the chromosomes.
                        A tab separated file with columns <chrom> <chrom lenght>.
                        NB: It can be created from the header of the bam file (see tip above).

  --slop SLOP, -S SLOP  Option passed to slopBed to define the flanking region (aka background).
                        If `int` each target will be extended left and right this many bases.
                        If `float` each target is extended left and right this many times its size.
                        E.g. 5.0 (default) extends each target regions 5 times its length left and right.

  --blacklist BLACKLIST, -bl BLACKLIST
                        An optional bed file of regions to ignore to compute
                        the local background. These might be unmappable regions with 0-counts which would
                        inflate the target enrichment.

  --tmpdir TMPDIR       Temp dir to use for intermediate files. If not set
                        python will get one. A subdir will be created here.

  --keeptmp             If set, the tmp dir is not deleted at the end of the
                        job (useful for debugging).

  --verbose, -V         Print to stderr the commands that are executed.

  --version             show program's version number and exit

Ram · Answer 3 · 2015-06-08

4

Entering edit mode

9.9 years ago

bernatgel ★ 3.4k

Although this is an old thread, since it includes a thorough list of tools, I'll include another one we recently released.

If you are using R, you can use regioneR, that will perform the process sketched by @TriS in another answer. It uses a permutation test approach, so it's not the fastest way to do that, but does not need to rely in modeling and since it implements different randomization strategies and accepts arbitrary genome masks, the test can be tailored to your specific needs. In addition to testing for the overlap between regions, it can test for the distance, the % of overlap or actually any custom function you feed to it.

For example, if you have your peaks and promoters in data.frame's you can do something like:

library(regioneR)
pt <- overlapPermTest(A=peaks, B=promoters)
plot(pt)

or, if you want to evaluate the size of your peaks, you can define a function returning the mean height of the peaks overlapping the promoters and use it in your test:

meanHeight <- function(A, B, ...) {
  overlapping.A <- subsetByOverlaps(A, B)
  return(mean(overlapping.A$height))
}
pt <- permTest(A=peaks, B=promoters, evaluate.function=meanHeight, randomize.function=randomizeRegions, ntimes=100)
plot(pt)

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.9 years ago by bernatgel ★ 3.4k

0

Entering edit mode

bernatel,

Since you mention regioneR - is there a discussion forum where users can contact regioneR developers? I like the permutation strategy but it is extremely slow. I have been running randomizeRegions() function for more than 10min [not finished yet] with no result coming up. The GRange files has >70,000 regions. If just one iteration of region randomization takes so much time I don't think it's feasible to do even more than 10 permutations but that is not enough really to have any statistical power.... Or else is it possible that the function is simply stuck without me even realizing it? As a test I have run randomizeRegions() on GRange file of 10 regions and it yielded a result relatively quickly.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.8 years ago by AlexAbdulkaderKheirallah ▴ 120

0

Entering edit mode

@bio.abdul.kader

You can find the email address of the package maintainer (in this case, me) in the pakage page at http://www.bioconductor.org/packages/regioneR. You could also write a message in bioconductor's support forum (https://support.bioconductor.org/) with a "regioneR" tag on it. In any case, regarding your question, we had a performance problem with the randomization process and it could take quite long to randomize large region sets. With the latest version in Bioconductor devel (to be released on October 2015) we have improved the performance of the randomization process 10 to 100x depending on the situation. In any case, to speed up things you could try with circularRandomizeRegions, which is MUCH faster, or set non.overlapping=FALSE in randomize regions, which would also speed up the computation quite a bit. In any case if you have any other problem you can write an email and we'll try to help you in using regioneR.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by bernatgel ★ 3.4k

Ram · Answer 4 · 2015-01-16

A pretty precise way of doing it is to bootstrap random genomic regions and get the % that overlaps with TSS regions, (edit - it's pretty similar to the approach the Ryan wrote, if you have fun coding, go for it)...to do this you:

take the size of your set i.e. 100 regions with your xx% intersection
randomly sample from the genome groups of 100 regions of the same size (bedtools allows you to do so) (with replacement)
calculate the % of those that overlap and save value in array
repeat 1000 or 10.000 times
plot the distribution of this (expected) vs the xx% intersection that you have from your initial regions (this will be a histogram for the expected and 1 vertical line for the observed values)
calculate the empirical pvalue as: ((number elements in simulated data >= xx%) +1 / ((number sampling) + 1) (the +1 is to avoid dividing by zero)

to get your "fold enrichment" take the average of the simulated data vs xx%

or you could just use hypergeometric distribution, but you won't have the distribution of the simulated data

Ram · Answer 5 · 2015-01-16

I did not mean to calculate the read enrichment. I meant to calculate the enrichment of promoters in my reads. Like this:

My reads contain XX% of reads intersecting with promoters. It is expected that one would get YY% of reads of this length intersecting with promoters by chance, and therefore, the enrichment of promoters is equal to XX/YY-fold in comparison with what one would get by chance. I am looking for the solution of this problem.

Ram · Answer 6 · 2015-01-25

0

Entering edit mode

10.3 years ago

chemcehn ▴ 210

I have tried bedtools jaccard -- it is great, except that it does not do exactly what I need. It calculates the enrichment of basepairs intersecting in my BED file and in the BED file with promoters.

However, what I need the enrichment of regions in my BED file intersecting (having at least one bp overlap) with promoters. Could you please advise with respect to this? Thanks!

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 10.3 years ago by chemcehn ▴ 210

1

Entering edit mode

try bedtools fisher. as of the last release it is interval based.

ADD REPLY • link 10.3 years ago by brentp 24k