Hi, I've been analysing an ATAC-seq dataset and I'm currently deduplicating the reads (after removing mitochondrial reads). Although I get something on the order of 15% reduction in reads when deduplicating my samples, I've gotten about a 90% reduction in reads after deduplicating the gDNA control (the experimental setup for the gDNA control was phenolchloroform-extracted DNA from the relevant cells incubated in the Tn5 in the same way as the permeabilised nuclei for the experimental samples).
Has anyone had this level of duplication in a gDNA control in an ATAC-seq experiment? Is it expected? Should I even bother deduplicating the gDNA control? Any advice will be much appreciated. Thanks in advance.
It's the first time I hear about such a control and we do the assay pretty much since its early days. What's the purpose for it? Isn't it mainly mitochondrial DNA?
It's to give an estimate of what the background cleavage of protein-free DNA is. Then peaks can be called against that control. May I ask then if you use controls for your ATAC experiments, and if so what they are? Thanks!
We do not do controls, and I've also not seen a study doing so. It makes some sense to do it honestly to assess bias of individual loci in terms of coverage, you would need to sequence it quite deep to coverage across the genome and standard peak callers like macs2 would then downsample it again towards the ATAC-seq samples which is typically just like 25mio reads, so there is imo little point in even doing the controls. After all we (and probably most people) are interested in differential analysis rather than "defining" open chromatin, hence the controls are not super necessary.
Thanks, good point. Will try peak calling with or without control files to see how different the outputs are.
I don't understand. I don't do ATAC-seq, but I think it would be helpful here if you explained what the controls are and how you prepared them. If you are getting much higher duplication levels from supposedly randomly-fragmented DNA than reads that are supposed to be clustered around peaks, at the same depth, your experiment is probably not valid.
I was trying to figure this out, thanks for pointing it out. I believe it has to do with the number of PCR cycles for library prep done in the samples vs the control, which for a technical reason was different. I was just gauging what others might think about this result.