Question

Is there a reasonable way to rank peaks that are common between replicates?

3

Entering edit mode

4.3 years ago

Aspire ▴ 370

I intend to find common peaks ( in ChIP-Seq data, ) defined as peaks that exist in any pair of the 3 replicates I have. Of course there are 3 pairs, so peaks that overlap in 1 base in any pair of those would be considered valid.

If peak A and peak B overlap between two replicates, I will define the final "peak" as the union between the area that peak A spans and the area that peak B spans.

The issue is that I would like to have some way to rank the final peaks, at least to decide which peaks are "the best peaks".

Each final peak is a combination of at least two peaks; while the original peaks have meaningful p-values and q-values, I don't think their combination would be meaningful for the combined peak.

Can you suggest a meaningful way to rank the combined peaks?

ChIP-Seq • 2.7k views

ADD COMMENT • link updated 5 months ago by ATpoint 87k • written 4.3 years ago by Aspire ▴ 370

1

Entering edit mode

I thought of something (perhaps it is very naive).

What I am interested is a crude measure of ranks, so that the (biologist) researcher can look at the "best" peaks, and see whether there are interesting genes among the top ranking peaks. Those best genes will be further validated (perhaps with ChIP-PCR on the original samples) to see whether the interesting results are true.

It seems to me that I do not need a well-defined statistical framework of high-ranking peaks. So, perhaps I will simply

Sort the peaks in each sample according to its p-value.
Assign a rank to each peak based on the p-value.
Divide each rank by the total number of peaks per replicate (so that ranks between different replicates are on the same scale).
When two peaks overlap, take the mean of the ranks as the final "rank" of the peak.

ADD REPLY • link 4.3 years ago by Aspire ▴ 370

0

Entering edit mode

I added a point 4) inspired by this.

ADD REPLY • link 4.3 years ago by ATpoint 87k

0

Entering edit mode

Man, I've been facing this question so many times and I still don't have a satisfying answer.

ADD REPLY • link 4.3 years ago by dariober 15k

1

Entering edit mode

4.3 years ago

steve ★ 3.5k

Maybe I am missing something, but isnt this what DiffBind does?

https://www.bioconductor.org/packages/devel/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf

Been using it for years to detect differentially bound peaks across groups and replicates for ChIP-Seq.

ADD COMMENT • link 4.3 years ago by steve ★ 3.5k

0

Entering edit mode

My question had to do with looking at only one condition, and deciding what are the peaks common to all replicates at that condition. I think that DiffBind has to do with comparing peaks across conditions.

ADD REPLY • link 4.3 years ago by Aspire ▴ 370

score 9 · Accepted Answer · 2020-12-07

You would need a statistical framework. I could think of several options:

1) Use a dedicated replicate-aware peak caller:

PePr, easy to use, accepts replicates, returns statistics per peaks which (from what I understand) respect the variability between replicates. It needs an input sample though.
Genrich, actually developed for paired-end data but can tweaked to accept single-end reads if you know the average insert size. Needs no input explicitely, combines statistics per peak using Fisher's method. Works in my hands well for ATAC-seq, did not test it for ChIP-seq. Method is currently unpublished.
CHIP-R, very new method I stumbled over on Twitter, seems to be a pipeline that somehow assesses replicates in terms of reproducibility, had no time yet to look into it though, feel free to post feedback if you try it. Edit 2024: Seems to be abandonware, no commits in years at GitHub, critical issues never received any feedback.

2) Naive intersections:

Intersect a) with b) and the output with c), take the overlaps. That is one the one hand stringent as peaks have to be present in all three sets but relaxed as it does not consider the variability, so only checks for binary presence. Once you have this call peaks on the merged bam files, intersect with this peak list you just created, and then rank by the p-values (or any other metric from the narrowPeak file) for those called peaks that intersect with the intersection set. Probably the weakest of all the methods I describe here, it is rather a thinking-aloud, probably not very reliable, and quite tedious.

3) Use IDR:

The Irreproducible Discovery Framework (here a tutorial) was developed to check for consistency in ChIP-seq data, but only accepts n=2. So you could take the two replicates with the best data quality and then use them to create a confidence peak set.

4) A rank-based meta-analysis.

https://cran.r-project.org/web/packages/RobustRankAggreg/index.html might be an option. Rank the peaks of each sample by a metric of choice, then use this package to prioritize them based on rank-consistency.

As always it depends: What is the final analysis goal?