Hello,
I have a question about best practices for performing differentially accessibility analysis for ATACseq data. I am a recent college graduate working in my first bioinformatics lab, so I apologize in advance if my terminology or wording is confusing/not fully correct.
I am analyzing bulk ATACseq data from treated and control samples isolated at 3 different timepoints. I have 3 replicates per condition for each of the 3 timepoints, so there are 18 samples total. We have previously performed bulk RNA-seq on these same samples, which showed fairly robust differential gene expression based on treatment. Given this, my initial hypothesis was that we would also find differentially accessible regions dependent on treatment corresponding to the changes in gene expression. However, after my first pass of the data, I found very low (<50) to no differentially accessible regions between treated and control samples (or even between timepoint).
As background on how I processed the samples:
- I processed the data via the nf-core ATACseq pipeline (standard processing).
- I then used MACS3 to call peaks for each of the samples, which identified anywhere from ~35k to ~75k significant peaks per sample. From those sample-specific peaks,
- I then used bedtools to merge peaks from all 18 samples to generate a list of ~300k consensus peaks present in at least one of the 18 samples.
- I then used the csaw package to generate a matrix of raw counts present in the regions defined by these 300k peaks.
- With this raw counts matrix for all 18 samples, I performed differential accessibility testing between treated and control samples using DESeq2. However, like I mentioned above, there were very few to no consensus peaks for several of the contrasts between treated and control samples that I performed.
I am thinking of reasons as to why so few differentially accessible regions were identified (the robust differential gene expression between treatment/control leads me to believe that we should also observe at least some modest differentially accessibility). My first thought is that, because the consensus peak list is so large (300k peaks) for the 18 samples, the large number of tests being conducted means the FDR would make it "difficult" to identify statistically significant differentially accessible regions.
Secondly, when I merged peaks across the 18 samples to obtain the 300k consensus peak list, I used a 1bp overlap to identify a peak as shared between any two samples. So because I took the whole range of peaks within that overlap to define the region of the "shared peak", some of the shared peaks might be quite large. I am not sure if this is true, but I would imagine that there is a lot of "noise" or random signal the further away you are from the summit of the peak. So basically, if a merged peak had a 3k bp window, there could be a lot of random noise in that huge window that could affect the ability to call differential peaks between samples.
Given that reasoning, I am thinking about how to revise this analysis. One idea I had is to merge called peaks for just the specific samples I am interested in doing DA analysis for, and run DESeq2 on that smaller consensus peak list. The second idea would be to first identify peaks shared between all 3 replicates for a given timepoint and condition. And then taking this consensus peak list for each of the 6 groups of 3 replicates and merging them (which would also reduce the total # of peaks, I think).
I would appreciate if anybody has feedback on my reasoning and has suggestions/insights for how I might approach this data. Again, I apologize if I was not clear in describing the data and my thoughts. I can provide my code if that is helpful.
Your reasoning makes sense to me. I do something similar. Usually I am stringent on per-sample peak calls by fording a minimum peak width and FDR cutoff, and then I only keep peaks shared between at least two/three/... (depends on group size) samples. That massively reduces the total number of peaks and helps with multiple testing burden. It also eliminates a lot of noisy/nonsense peaks.