Question

Chip Seq Analysis Using Macs At A Pvalue Of 1E-2 Then Interesecting To Call "True Peaks"

1

Entering edit mode

12.4 years ago

jhrf ▴ 10

I am currently analysing ChipSeq data from 4 different proteins in order to build up some idea of correlations between and across the c. elegens genome. Essentially I want to see where each protein overlaps with the others and where.

So far I have called peaks on all of my data sets (which include biological and technical replicates) I am now browsing data before I start comparing to find correlations (overlaps, intersections etc).

Some of my data is quite noisy, and in order to get the best out of it I have run MACS2 on a relatively low pvalue threshold (5e-2) and then only taken peaks which are confirmed across technical and biological replicates, hoping to catch noise and wrongly called peaks at this step. It seems to have worked empirically and I am seeing sensible results. However, this is my first solo bioinformatics project and I just wanted to check to see if this was a sensible method.

Is anyone able to recommend a better method? Is my MACS2 cutoff prohibitively low? Can anyone point me to papers which details methods for this sort of thing? I bow to the greater knowledge and wisdom of this community. Many thanks.

macs peak-calling calling • 6.2k views

ADD COMMENT • link updated 12.4 years ago by alessandro.riccombeni ▴ 20 • written 12.4 years ago by jhrf ▴ 10

score 3 · Answer 1 · 2013-07-02

3

Entering edit mode

12.4 years ago

KCC ★ 4.1k

Instead of using a p-value of 0.05, why not use a q-value of 0.05? I think 0.05 is quite low for a p-value for MACS.
I would also suggest using IDR. It determines the reliability of peaks based on the replicability, https://sites.google.com/site/anshulkundaje/projects/idr
Do you keep the duplicate reads? MACS has a setting to keep just one read. You should use it. This can help with the sensitivity to noise.

Here are some papers, Systematic evaluation of factors influenicng ChIP-seq fidelity. Nat Methods 2012; 9(6):609-614. Identifying ChIP-seq enrichment using MACS. Nat Protoc 2012; 7(9):1728-40. Measuring reproducibility of high-throughput experiments Ann. Appl. Stat. Volume 5, Number 3 (2011), 1752-1779.

ADD COMMENT • link 12.4 years ago by KCC ★ 4.1k

0

Entering edit mode

Thanks for your comment. I am making my way through the papers you recommend.

Are there any studies on the advantages of using pvalue over qvalue? I think my methods will come under a fair amount of scrutiny and I'd love to have something solid to back it up.

ADD REPLY • link 12.4 years ago by jhrf ▴ 10

1

Entering edit mode

From what I recall, the author of macs, Tao Liu, recommended the q value over the p value. You can join the macs mailing list and ask him directly about this. My guess is that the q value is more empirical as it's based on the number of false positives in the input control, while the p value is based on a model of the data which is probably too simple.

ADD REPLY • link 12.4 years ago by KCC ★ 4.1k

score 1 · Answer 2 · 2013-07-02

Hi jhrf, if this is your first "solo" project I recommend starting by looking at what other people have done. The IDR test suggested by George is a great way to start, as it's been recommended by the ENCoDE project itself. You should probably start by reading this: http://www.ncbi.nlm.nih.gov/pubmed/22955991

And then go through the method linked by George, and verify the results you get from your data. You might try to set q at 0.05 and 0.01 and compare the results.

Also, try to define (if you didn't already) some quantitative definition of "overlap" for your peaks. Peaks from different replicates located in the same promoter could have relatively distant summits, and that's where the binding site is likelier to be, i.e. you could be putting together different binding sites.