Hi guys,
I've started dealing with ATAC-seq data and I have a couple of questions that hopefully some of you have pondered over, too.
My understanding is as follows:
- Tn5 will cut anywhere when there's not a nucleosome, therefore regions of generally open chromatin will result in smallish fragments between 40bp to ca. 100 bp -- I assume this is simply a function of the presence of certain motifs that the transposase prefers?
- fragments between 150-300 bp are representative of events where the transposase managed to cut at both sides of a single nucleosome (of course, this can go on as the transposase may also cut out two nucleosomes, resulting in fragments of 2x150bp + x bp of linker DNA)
All publications that I've seen so far show the characteristic distribution of fragment lengths, with high peaks for very short (< 150bp) and smaller bumps for fragment lengths that are representative of mono-, di-, tri-nucleosomes. In fact, in data I've looked at, we barely see any fragments greater than 300 bp which does not surprise me given Illumina's inherent preference for short fragments.
So, here are my questions:
1. What is the background signal in ATAC-seq?
Since closed chromatin will lead to larger fragments, these will never be seen in the same quantities as the short fragments even if they are there (and surely, the majority of the genome is, in fact, not nucleosome-free as the ATAC-seq histograms may suggest - or is it?). Have you seen ATAC-seq data where the entire genome is somewhat uniformly covered with strong enrichments around the promoters?
2. What is the peak calling meant to achieve?
I come from the ChIP-seq world where peak calling is your best shot at zooming into regions that are not just open chromatin, but actually binding sites of the transcription factor you were trying to precipitate. Peak callers like MACS try to understand what the background signal is (i.e., the majority of reads covering the most part of the genome) and then pinpoint regions that are at the extremes of that background model. For ATAC-seq, since I'm not sure what the background signal is supposed to be (since closed chromatin is definitely under-represented), what is the peak calling really meant for? And is MACS actually an appropriate means to that end given that there's no real uniform coverage?
3. Is ATAC-seq more similar to RNA-seq than to ChIP-seq?
Following these lines of thoughts, should one think of ATAC-seq really more in terms of RNA-seq analysis than of ChIP-seq analysis? After all, it seems to me as if ATAC-seq peaks may be equivalent to identifying "expressed genes" (because unexpressed genes are also usually missing from RNA-seq) and the analysis should really focus on the differential read counts between the same region in two samples. If that is the case, this opens a whole other can of worms (e.g. defining the regions, normalizing read counts, number of replicated samples etc.) that should probably be discussed in a different thread.
I appreciate any insights and critical comments!
Cheers,
Friederike
Thanks for sharing!
Where is that coming from though? In ChIP-seq, DNA ist most commonly fragmented using sonication and fragments are size selected prior to sequencing. While this is not completely random, we tend to see virtually the entire genome covered, which indicates to me that the sonication eventually manages to break even nucleosomal DNA apart. For ATAC-seq there's no size selection and my perhaps naive impression was that the transposase is not really going to unravel nucleosomal DNA, so while it can cut in closed regions, the resulting fragments will become so long that they will hardly be sequenced. Are you saying that, at least for bulk ATAC-seq, the transposase seems to be able to integrate in generally closed chromatin regions (that may be open stochastically in individual cells), therefore generating short fragments from closed regions that will show up in the sequenced reads?
I can see how your workflow makes total sense, in my case however, people are not necessarily interested in specific TF, but just want to see whether their experimental perturbations lead to changes in chromatin accessibility. In that regard I would also be interested to know whether you think that actual changes in the peak height (after somewhat accounting for differences in sequencing depth) are meaningful (in ChIP-seq, I would be very hesitant to do so because the enrichment depends on so many technical factors). Now that I'm thinking about it - why _are_ the promoters (and enhancers) so dramatically enriched anyway? Does that imply that the gene bodies are never as "open" as the promoters although they need to accommodate the entire transcription machinery?