Hi all, I'm relatively new to ATAC-seq and bioinformatics in general. I am processing an ATAC-seq dataset (have treatment and control), in particular looking to find differential peaks and then seek for TF binding sites by motif finding and JASPAR.
- Reads were trimmed, aligned to reference genome using
bowtie2
, and filtered - Peaks called using
macs2
. For instance:
macs2 callpeak --nomodel -t Sample_WT_dedup.bam -n Sample_WT --keep-dup all --gsize 11250000000 --shift -37 --extsize 73 --outdir macs2_output -B
- The above was done for both the WT (control) and treatment data. Then I tried to use
bdgdiff
to call differential peaks, following the guidelines here
macs2 bdgdiff --t1 Sample_LMN_output/macs2_output/Sample_LMN_treat_pileup.bdg
--t2 Sample_WT_output/macs2_output/Sample_WT_treat_pileup.bdg
--c1 Sample_LMN_output/macs2_output/Sample_LMN_control_lambda.bdg
--c2 Sample_WT_output/macs2_output/Sample_WT_control_lambda.bdg
--d1 15430860 --d2 23095133 -g 60 -l 120 --outdir WT_LMN_diff --o-prefix WT_LMN
Here, I have taken --d1 and --d2 to be respectively the number of tags encountered by macs2 in the respective samples (extracted from .xls
output). From what I understand, MACS2 uses these inputs for read-count normalization so the fact that I have more reads in WT than treatment samples should be corrected for.
However, the output is otherwise: only 223 peaks are enriched in the treatment (relative to WT), whilst ~10000 peaks are enriched in the WT (relative to treatment).
Furthermore, some of the log-likelihood ratios are crazy... see last column
chr2 19825317 19825545 WT_LMN_cond1_89 -632933.75000
chr2 19826135 19826362 WT_LMN_cond1_90 -647127.06250
chr2 19827419 19827635 WT_LMN_cond1_91 -683938.56250
chr2 19832164 19832295 WT_LMN_cond1_92 -2365761.25000
chr2 19832486 19832752 WT_LMN_cond1_93 -736282.37500
chr2 22930658 22930814 WT_LMN_cond1_94 -1535228.75000
Seems like there is still the effect of the amount of reads in each sample. I've looked around for information regarding this, but I haven't really found anything yet. To my understanding MACS2 bdgdiff should be able to deal with normalization itself (since it is taking the --d1 and --d2 params).
Am I correct that MACS2 does its own read count normalization? Or does this have to be a manual step ahead of time?
Would be grateful for any help :)
Stephen
I recommend against using macs2 for differential peak analysis. The point with MACS is that it is designed for ChIP-seq, where typically the summit read count is the critical parameter to be used for the analysis. In ATAC-seq, the assay creates cutting events all over the open chromatin regions. In my understanding, the summit is not always very informative about changes in accessability, as these can happen on the edges of a region. Better call peaks on both your conditions, merge and create a consensus peak list. Then use featureCounts or a similar tool to count reads over these regions and feed the count matrix into DESeq2. This of course only makes sense if you have (enough) replicates. If you have no replicates, thresholding the log2-fold change at say 1.5x may give you a rough idea in which regions things are changing upon treatment.
Hi ATpoint,
I face the same situation, and so I'll be very grateful if you could please elaborate a bit on your suggestion above. That is, what is a workable strategy for getting a rough idea of the differential peak regions in ATACseq data between two samples without any replicates. My situation is that the biologist has some prelim data where there are no replicates. Soon after this prelim analysis and getting some rough ideas about the peaks, he is going to repeat the experiment with more replicates and paired-end sequencing. Right now, I have only single-end reads, two conditions, and have the macs results with me. Wish to get some rough differential peak regions between the two conditions.
Thanks a lot for any replies!
What you can do is to count reads in the peak regions, e.g. using
featureCounts
(if you have single-end data, you should elongate the reads to a reasonable fragment size, e.g. 100bp, and then calculate the lfc (log2 fold change). Then set a threshold and call everything below -(threshold) or above +(threshold) as a candidate differentially accessable region. Depending on the treatment and the sample type (cell line or primary) you'll probably get a whole lot of regions, simply due to the technical variation (especially in regions with low counts), but maybe you have some positive control regions, in which you know that effects must/should happen, so that you can have a closer look at them.If the treatment isn't expected to affect open regions (but only transcription factor binding) then you'll be hard pressed to find differences with MACS2. The general strategy would be to use MACS2 to find open chromatin and then use footprinting (e.g., with wellington) to find TF binding sites within those open regions. My presumption is that the regions will be open regardless of the treatment, its the footprints that will differ (this may change if your TF affects the openness of a region).