Question

ATAC-seq: differential peaks using macs2 bdgdiff yields strange results

1

Entering edit mode

6.8 years ago

stephenz ▴ 10

Hi all, I'm relatively new to ATAC-seq and bioinformatics in general. I am processing an ATAC-seq dataset (have treatment and control), in particular looking to find differential peaks and then seek for TF binding sites by motif finding and JASPAR.

Reads were trimmed, aligned to reference genome using bowtie2, and filtered
Peaks called using macs2. For instance:

macs2 callpeak --nomodel -t Sample_WT_dedup.bam -n Sample_WT --keep-dup all --gsize 11250000000 --shift -37 --extsize 73 --outdir macs2_output -B

The above was done for both the WT (control) and treatment data. Then I tried to use bdgdiff to call differential peaks, following the guidelines here


    macs2 bdgdiff --t1 Sample_LMN_output/macs2_output/Sample_LMN_treat_pileup.bdg
    --t2 Sample_WT_output/macs2_output/Sample_WT_treat_pileup.bdg 
    --c1 Sample_LMN_output/macs2_output/Sample_LMN_control_lambda.bdg 
    --c2 Sample_WT_output/macs2_output/Sample_WT_control_lambda.bdg 
    --d1 15430860 --d2 23095133 -g 60 -l 120 --outdir WT_LMN_diff --o-prefix WT_LMN

Here, I have taken --d1 and --d2 to be respectively the number of tags encountered by macs2 in the respective samples (extracted from .xls output). From what I understand, MACS2 uses these inputs for read-count normalization so the fact that I have more reads in WT than treatment samples should be corrected for.

However, the output is otherwise: only 223 peaks are enriched in the treatment (relative to WT), whilst ~10000 peaks are enriched in the WT (relative to treatment).

Furthermore, some of the log-likelihood ratios are crazy... see last column

chr2    19825317        19825545        WT_LMN_cond1_89 -632933.75000
chr2    19826135        19826362        WT_LMN_cond1_90 -647127.06250
chr2    19827419        19827635        WT_LMN_cond1_91 -683938.56250
chr2    19832164        19832295        WT_LMN_cond1_92 -2365761.25000
chr2    19832486        19832752        WT_LMN_cond1_93 -736282.37500
chr2    22930658        22930814        WT_LMN_cond1_94 -1535228.75000

Seems like there is still the effect of the amount of reads in each sample. I've looked around for information regarding this, but I haven't really found anything yet. To my understanding MACS2 bdgdiff should be able to deal with normalization itself (since it is taking the --d1 and --d2 params).

Am I correct that MACS2 does its own read count normalization? Or does this have to be a manual step ahead of time?

Would be grateful for any help :)

Stephen

ATAC-seq macs2 • 8.7k views

ADD COMMENT • link updated 6.4 years ago by Alternative ▴ 290 • written 6.8 years ago by stephenz ▴ 10

1

Entering edit mode

I recommend against using macs2 for differential peak analysis. The point with MACS is that it is designed for ChIP-seq, where typically the summit read count is the critical parameter to be used for the analysis. In ATAC-seq, the assay creates cutting events all over the open chromatin regions. In my understanding, the summit is not always very informative about changes in accessability, as these can happen on the edges of a region. Better call peaks on both your conditions, merge and create a consensus peak list. Then use featureCounts or a similar tool to count reads over these regions and feed the count matrix into DESeq2. This of course only makes sense if you have (enough) replicates. If you have no replicates, thresholding the log2-fold change at say 1.5x may give you a rough idea in which regions things are changing upon treatment.

ADD REPLY • link 6.8 years ago by ATpoint 85k

0

Entering edit mode

Hi ATpoint,

If you have no replicates, thresholding the log2-fold change at say 1.5x may give you a rough idea in which regions things are changing upon treatment.

I face the same situation, and so I'll be very grateful if you could please elaborate a bit on your suggestion above. That is, what is a workable strategy for getting a rough idea of the differential peak regions in ATACseq data between two samples without any replicates. My situation is that the biologist has some prelim data where there are no replicates. Soon after this prelim analysis and getting some rough ideas about the peaks, he is going to repeat the experiment with more replicates and paired-end sequencing. Right now, I have only single-end reads, two conditions, and have the macs results with me. Wish to get some rough differential peak regions between the two conditions.

Thanks a lot for any replies!

ADD REPLY • link 6.4 years ago by PR ▴ 50

1

Entering edit mode

What you can do is to count reads in the peak regions, e.g. using featureCounts (if you have single-end data, you should elongate the reads to a reasonable fragment size, e.g. 100bp, and then calculate the lfc (log2 fold change). Then set a threshold and call everything below -(threshold) or above +(threshold) as a candidate differentially accessable region. Depending on the treatment and the sample type (cell line or primary) you'll probably get a whole lot of regions, simply due to the technical variation (especially in regions with low counts), but maybe you have some positive control regions, in which you know that effects must/should happen, so that you can have a closer look at them.

ADD REPLY • link 6.4 years ago by ATpoint 85k

0

Entering edit mode

If the treatment isn't expected to affect open regions (but only transcription factor binding) then you'll be hard pressed to find differences with MACS2. The general strategy would be to use MACS2 to find open chromatin and then use footprinting (e.g., with wellington) to find TF binding sites within those open regions. My presumption is that the regions will be open regardless of the treatment, its the footprints that will differ (this may change if your TF affects the openness of a region).

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

score 1 · Answer 1 · 2018-03-08

1

Entering edit mode

6.7 years ago

firatuyulur ▴ 320

I was working with MACS2 a lot last year and had similar issues. I cannot give a straight answer to this question but while calling my peaks from bam files I remember using -B --SPMR which asks MACS2 to generate pileup signal file of 'fragment pileup per million reads' in bedGraph format. Then you can compare the normalised bdg files. Or there are tools made for differential binding calculation like DiffBind where the tool normalises internally. I hope this helps.

ADD COMMENT • link 6.7 years ago by firatuyulur ▴ 320

0

Entering edit mode

The suggestion of using --SPMR goes against the author's own recommendations: https://github.com/taoliu/MACS/wiki/Call-differential-binding-events

ADD REPLY • link 5.8 years ago by Ian 6.1k

score 0 · Answer 2 · 2018-07-04

Differential analysis depends on raw read counts. Macs2 generates very large and diffuse peaks for ATAC-seq and chromatin marks. As such, a differentially called peak may not be real since the difference is calculated between hundreds of reads for large peaks and when you check them visually, they will be scattered over the peak lengths and not really differentially bound/targetted. I had this problem when I used MACS2 peaks to call differential events on chromatin regulators.