Question

Computational Validation Of Chip-Seq Peaks

4

Entering edit mode

11.2 years ago

daniel.soronellas ▴ 330

Hi all,

I have a ChIPseq of a TF (narrow peaks), without replicates, which peaks were found with a good quality using an INPUT dna as control. QUESTION: Is there any statistical way to validate the final list of peaks in the downstream analysis, in order to filter those peaks which are called as statistically significant, but when looked by eye in a browser the enrichment they present seems to be like background?

Below I put an example of what I'm describing in a region of 76 Kb region:

enter image description here

In the first pair of tracks, one could appreciate the enrichment of the TF (red) versus the input (black) highlighted as blue boxes. But in the second pair of tracks, the peak caller mark just the small peak in the center of the blue box (marked also with another track in black as a little profile, just above the peak). The statistical parameters used to get the peaks are pvalue < 10e-5 and FDR < 1%.

chip-seq • 5.4k views

ADD COMMENT • link 11.2 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

I read in papers about validating peaks by PCR, or finding motifs in overall list of peaks but my concern is to find a method which would maybe re-filter the peaks so close to background and keep only visible and statistical significant peaks

ADD REPLY • link 11.2 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

I don't think computational validation can be done, ChIP-seq is a technique to assay genome wide protein occupancy and the only way to validate it is to use another tool to generate a dataset that also gives information on protein occupancy (such as ChIP-qPCR). IMO the question you are asking boils down to 'I ran several peak callers but some regions that it found are not believable by eye, is there any way to avoid them?' The typical answers would be to a) try another peak caller (if you want a very conservative one, you could try sole search) b) use different cutoffs and from your response you did something similar to b) where you made a special filter rule but the disadvantage of doing something like this is it might seem arbitrary and hard to justify. One thing you could try is overlap the regions found by the different peak callers and call that the 'concordant set' or something

ADD REPLY • link 11.2 years ago by Ying W ★ 4.3k

0

Entering edit mode

Daniel, just a question - unfortunately not concerning the original question. How did you draw these kind of signal profiles? I think these are wig signal profiles but i dont know how to construct such figures without using e.g. UCSC browser export function. Best regards!

ADD REPLY • link 10.2 years ago by nx68 • 0

score 3 · Answer 1 · 2013-10-01

3

Entering edit mode

11.2 years ago

Sukhi Singh 11k

You already statistically validated them after calling peaks.

I had a chat with Dr.Liu (Shirley Liu) author of Macs14 about this. If you are using Macs14 (looks like from the pval and FDR cutoff) and input as control which is generally a symmetric distribution of captured noise, take top 30K peaks, if you have that many else if you know how your protein should bind, take the respective number of peaks.

In one of your files returned from Macs14, it reports the fold enrichment, you can apply a threshold on it manually to see how many peaks pass.

Try using another input or mock(using mock will reduce the number of peaks), to see an overlap.

Try another peak caller.

ADD COMMENT • link 11.2 years ago by Sukhi Singh 11k

0

Entering edit mode

Thanks for your answer!

So far what I did:

I tried 3 peak callers: MACS (1.4), HOMER(4.3) and Pyicos(1.2), playing a bit with parameters and running default mode. Although 80 % of peaks are perfectly clear and properly matched I still find this tiny small peaks.
I could filter them by fold enrichment, but I was more interested in a way to add statistics to my filtering and not by guessing how the peak looks in the browser. And for the control I only have input DNA
In my final list of peaks I also counted the reads of TF and INPUT, then I normalized by total reads mapped and apply a FC between them, but I was not sure if this is conclusive as a final step for filtering

ADD REPLY • link 11.2 years ago by daniel.soronellas ▴ 330

score 0 · Answer 2 · 2013-10-01

I think I came up with a possible solution to this:

First find the peaks, whatever number you could obtain with a certain stringent cut-off (i.e. pvalue < 10e-5 and FDR < 1%).
Get peak summits and make a window of 100 bp around the center.
Then compute complementary regions (background) using a script or any tool (like complementBed), filter by HDR (High duplication regions) and divide each interval into bins of 100bp.
Using background bins, count reads of TF sample and then obtain the median value of reads + standard deviation.
Use the obtained number as a cut-off so that peaks which "height/2" less than cut-off calculated, may potentially be filtered out of the list.

The solution is based on the hypothesis that lower peaks may have its middle point close to the general background of the sample. In order to demonstrate this I upload a slash of this filtering (blue boxes indicate peaks initially detected and filtered out using this method):

enter image description here

The first track shows the initially detected peaks
Second track points the filtered peaks
Third and fourth shows the samples profile

Because I'm not sure about the statistical relevance of this method I ask for feedback.

However for TF datasets it seems to work. I will try with a few more samples and see the results