Question

Best Practices for ATACseq for Performing Differential Accessibility Analysis

1

Entering edit mode

6 months ago

newbioinfograd2323 ▴ 10

Hello,

I have a question about best practices for performing differentially accessibility analysis for ATACseq data. I am a recent college graduate working in my first bioinformatics lab, so I apologize in advance if my terminology or wording is confusing/not fully correct.

I am analyzing bulk ATACseq data from treated and control samples isolated at 3 different timepoints. I have 3 replicates per condition for each of the 3 timepoints, so there are 18 samples total. We have previously performed bulk RNA-seq on these same samples, which showed fairly robust differential gene expression based on treatment. Given this, my initial hypothesis was that we would also find differentially accessible regions dependent on treatment corresponding to the changes in gene expression. However, after my first pass of the data, I found very low (<50) to no differentially accessible regions between treated and control samples (or even between timepoint).

As background on how I processed the samples:

I processed the data via the nf-core ATACseq pipeline (standard processing).
I then used MACS3 to call peaks for each of the samples, which identified anywhere from ~35k to ~75k significant peaks per sample. From those sample-specific peaks,
I then used bedtools to merge peaks from all 18 samples to generate a list of ~300k consensus peaks present in at least one of the 18 samples.
I then used the csaw package to generate a matrix of raw counts present in the regions defined by these 300k peaks.
With this raw counts matrix for all 18 samples, I performed differential accessibility testing between treated and control samples using DESeq2. However, like I mentioned above, there were very few to no consensus peaks for several of the contrasts between treated and control samples that I performed.

I am thinking of reasons as to why so few differentially accessible regions were identified (the robust differential gene expression between treatment/control leads me to believe that we should also observe at least some modest differentially accessibility). My first thought is that, because the consensus peak list is so large (300k peaks) for the 18 samples, the large number of tests being conducted means the FDR would make it "difficult" to identify statistically significant differentially accessible regions.

Secondly, when I merged peaks across the 18 samples to obtain the 300k consensus peak list, I used a 1bp overlap to identify a peak as shared between any two samples. So because I took the whole range of peaks within that overlap to define the region of the "shared peak", some of the shared peaks might be quite large. I am not sure if this is true, but I would imagine that there is a lot of "noise" or random signal the further away you are from the summit of the peak. So basically, if a merged peak had a 3k bp window, there could be a lot of random noise in that huge window that could affect the ability to call differential peaks between samples.

Given that reasoning, I am thinking about how to revise this analysis. One idea I had is to merge called peaks for just the specific samples I am interested in doing DA analysis for, and run DESeq2 on that smaller consensus peak list. The second idea would be to first identify peaks shared between all 3 replicates for a given timepoint and condition. And then taking this consensus peak list for each of the 6 groups of 3 replicates and merging them (which would also reduce the total # of peaks, I think).

I would appreciate if anybody has feedback on my reasoning and has suggestions/insights for how I might approach this data. Again, I apologize if I was not clear in describing the data and my thoughts. I can provide my code if that is helpful.

differential-accessibility csaw DESeq2 ATACseq • 1.5k views

ADD COMMENT • link updated 5 weeks ago by Picasa ▴ 690 • written 6 months ago by newbioinfograd2323 ▴ 10

1

Entering edit mode

Your reasoning makes sense to me. I do something similar. Usually I am stringent on per-sample peak calls by fording a minimum peak width and FDR cutoff, and then I only keep peaks shared between at least two/three/... (depends on group size) samples. That massively reduces the total number of peaks and helps with multiple testing burden. It also eliminates a lot of noisy/nonsense peaks.

ADD REPLY • link 6 months ago by ATpoint 88k

0

Entering edit mode

Hi, I have almost the same dataset then you. Is it possible to share the commands for the points 3,4,5 please ? Thanks.

ADD REPLY • link 5 weeks ago by Picasa ▴ 690

score 2 · Answer 1 · 2025-01-09

I have had great success using genrich as peak caller for ATAC-Seq upstream of csaw. I recommend you consider it.

With genrich, replicates are jointly analyzed, allow you to produce a single peak-set for each experimental condition (note: I believe that using genrich parameters -q .05 -a 0.0 are arguably sensible and find they produced empirically plausible results).

In your case, you would produce 6 peak-sets, one for each combination of time-point and condition. (drag&drop them into IGV to visualize!)

These peak-sets are in narrowPeak format, with each peak attributed with values used in its calling, along with a final pValue and qValue, allowing subsequent filtering, if so desired.

You can then merge your 6 (possibly filtered) condition-specific peak sets into a single reference peak-set. (note: I prefer the term reference since consensus connotes agreement, whereas you may well find peaks in one condition that are absent in another, lacking consensus, which will undoubtedly be quite interesting to your analysis).

You may implement the merge using bedtools, but I found it valuable (e.g. for visualization in IGV) to instead use the capabilities provided by R/BioConductor's GenomicRanges::reduce to merge proximal (e.g. within 100bp) peaks and carry forward, as narrowPeak attributes of each resultant reference peak, the narrowPeak attributes of the input merged peak having the smallest pValue. (again: drag&drop into IGV to visualize!)

If you do this, I bet your reference set has far fewer than your current 300K. Let us know!

I then use csaw to (a) test for differential accessibility every 50bp window overlapping any peak in the reference peak-set (specifically using csaw's glmQLFTest with adjusted p value=.05) and (b) combine the test results at the reference peak level (thus controlling the overall false discovery rate), using csaw::overlapResults.

The results seem quite plausible in terms of downstream interpretations (e.g. motif enrichment, GO analysis, etc) recapitulating know players.

Let us know how it goes for you!

FWIW: some un-discussed aspects:

read mapping - I use STAR
use of a "blacklist" bed file - I make one from INPUT ChIP-Seq loosely following https://www.nature.com/articles/s41598-019-45839-z - depending on your organism/genome version you may find one from another lab readily available
production of 50bp window coverage matrix - I used bedtools makewindows and featureCounts for tabulation followed by quantile normalization (because my sample preps seemed so disparate in dynamic range for historical reasons)
QA