Question

Peak Calling with MACS (treat and control samples)

1

Entering edit mode

9.6 years ago

ukpersia1 ▴ 20

I would like to use MACS for differential peak calling.

I have a set of data (GSM947524) (cancer data as treatment data) and was wondering if I can use (GSM947411) as control for it? or should I use (GSM947523) as control?

Whole data sets can be accessed by Query DataSets for GSE38685 (which includes 12 samples)

I have the following command line for peak calling:

macs14 -t treat.sam -c control.sam -g hs -n cancer -w -S

where "treat" is the cancer file GSM947524 and "control" is the one that I'm confused about!

ChIP-Seq peak-calling MACS • 14k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by ukpersia1 ▴ 20

0

Entering edit mode

Just a suggestion: Since the data you are looking at is H3k4me3, one would expect broad peaks. It is good to run macs whenever you are dealing with epigenome data, with --nomodel parameter and if you know the fragment size, with --shiftsize equal to half the fragment size (usually its ~200 for 40-50bp reads).

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by poisonAlien ★ 3.2k

1

Entering edit mode

I do not think this is a good advice, especially for beginner. First, H3K4me3 is not really the broadest histone mark there is, I think point peak calling should work. If it doesn't, the new version of MACS2 has a --broad parameter which should combine the nearby point peaks. Saying it is good to run macs [..] with --nomodel parameter just like that is quite irresponsible, in my opinion, especially to a beginner.

Namely, these parameters are automatically estimated when running MACS in the default mode. If you provide them manually you are basically saying "I believe I can do better than MACS at estimating them". It is a strong assumption. You need to be extremely confident about the inner workings of MACS to do this in my opinion. Particularly, one needs to know is the MACS2 model in the first place, what --shiftsize* is for, what is the fragment length, how to figure it out, and once it is known, how to set the appropriate parameters of the software accordingly. Most importantly you would need to justify your choice of parameters, verify that the assumptions you make are actually correct, and be aware how that influences the resulting p-values. I, personally, would not feel too confident about doing this.

* Actually, shiftsize is now removed from the latest MACS version, in favour of --extsize which should be set to twice the length of the old shift size (see the --help).

P.S. How does one do inline code blocks like the author of the comment? Backticks don't work :(

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Saulius Lukauskas ▴ 540

1

Entering edit mode

The MACS parameters Dr. Liu suggested for H3K4me3 ChIP-Seq as below. I have tried different parameters for our H3K4me3 ChIP-Seq data, and found that the parameters Dr. Liu suggested are best. However, which parameters are better heavily depends on our data quality and data structure. Actually, it's case by case. Please see the detail of this great paper and other related papers.

Feng JX, Liu T, Qin B, Zhang Y, Liu XS. 2012. Identifying ChIP-seq enrichment using MACS. Nature Protocols 7:1728-1740.

macs14 -t BROAD_GM12878_H3K4me3.bam -c BROAD_GM12878_H3K4me3_Control.bam -g hs -n BROAD_GM12878_H3K4me3 --nomodel --shiftsize 73 -B -S --call-subpeaks

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Gary ▴ 480

0

Entering edit mode

The 'case per case' bit is precisely what I meant -- sticking with defaults at first is always a good shout. I believe we might be starting to confuse the OP, but, since this thread is starting to become a complete reference on MACS peak calling I will add that Roadmap Epigenome also use --nomodel and manual --ext size for their data calling. They estimate the fragment length using strand cross correlation (link, section c).

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Saulius Lukauskas ▴ 540

0

Entering edit mode

I would like to add this reference as well from the same author: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/

Take a look at the section "Critical Parametsrs and troubleshooting", where they explain suggested argumenst for epigenome data. Anyhow, along with --nomodel and --shiftsize, you can also increase the --mfold from default 10,30 to 10,100 (or even higher if you are looking for super enhancers using H3K27Ac) since most of the histone modifications are highly accumulated and generally cross the default threshold of 30, which would be missed by MACS for model building.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

Hi poisonAlien, may I have a question here? Now, I use ROSE (https://bitbucket.org/young_computation/rose) to identify super-enhancers based on H3K27ac or Med1 ChIP-Seq. I don’t know MACS can identify super-enhancers, too. Could you tell me how to manipulate it? Many thanks.

ADD REPLY • link 9.6 years ago by Gary ▴ 480

2

Entering edit mode

MACS does not identify super enhancers. It just gives you peaks (or just "regular enhancers"). You need to stich those closely spaced peaks and rank them according to their intensity, which is what ROSE does. You first call peaks with MACS, then use the called peaks (e.g: H3K27Ac.bed) as input to rose.

ROSE_main.py -g hg19 -i H3K27Ac_peaks.bed -r H3K27Ac.bam -o outPut_dir -c Input.bam

You may want to add -t 2000 if you want to remove those peaks overlapping with tss. Again I am no expert, but this is what I follow.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

Thanks. It is my misunderstanding.

ADD REPLY • link 9.6 years ago by Gary ▴ 480

0

Entering edit mode

Thanks very much for your input to this. I am so glad that this website exist :) at least for beginners like me is a massive help.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by ukpersia1 ▴ 20

Ram · Answer 1 · 2015-04-07

As I already described in your previous post with this issue (peak calling for cancer and healthy data?), I believe the best solution is to use the input sample (GSM947411) and do the peak calling twice. That's what I would do.

EDIT: Just to reiterate this quickly, you need to distinguish between biological control and signal (normal cell line v. cancer cells) from technical control and signal (background noise levels v. supposed signal levels).

I think you are slightly confusing what peak callers like MACS do. In the essence, peak callers are just tools to distinguish between signal and noise regions. The input they take is just a set of x versus y measurements and output x coordinates that have significantly large y values. They have no biological assumptions encoded between themselves, just the assumptions about how the background noise levels are distributed. Since this noise is not uniformly distributed across the x axis (i.e. genome), MACS needs the input sample (which they call 'control') to estimate the per-base-pair noise levels. Once these levels are estimated, one can proceed to look at y coordinates that are significant.

Now your question is: "do the peaks differ between normal and cancer cell lines?". Translated to the x-y analogy above, it would be 'do the x coordinates where y is significantly large for normal cell line and the x coordinates where y is significantly large for cancer cell line differ?'. In this language it is perhaps clearer why MACS need to be run twice, and that it does not answer that question for you*. Hope this helps.

* Well, the standard options don't. As I mention in your previous post (peak calling for cancer and healthy data?) there are bdgcmp and diffpeaks options in MACS that might be helpful.

Ram · Answer 2 · 2015-04-07

Both samples (i.e. GSM947411 & GSM947523) can be your control. Which one is better depends on your biological question and experimental design. In the beginning (i.e. in 2007), people can use only one sample without Input (control) to identify peaks. Today, more and more people use replicate samples with Input or IgG control to identify peaks. No one can tell you what method is absolute better, although what Saulius Lukauskas said is the mainstream strategy people use today. In fact, only people really understand the biological background, usually it's you, can suggest a better experimental design and a better control sample. In practice, you can try both controls to determine with one is better based on whether their results are consistent with your biological phenomena. I hope this might be helpful.

Ram · Answer 3 · 2015-04-07

0

Entering edit mode

9.6 years ago

GouthamAtla 12k

MACS can be run without control sample. From MACS manual:

-t/--treatment FILENAME
This is the only REQUIRED parameter for MACS.

Read the section 'Control sample' from the Encode paper to understand exactly what is control sample and its uses.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by GouthamAtla 12k

0

Entering edit mode

Yes indeed. but if I want to use a control sample which of the above samples is suitable?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by ukpersia1 ▴ 20