I am in the process of analysing some ATAC-Seq data. I have already performed QC on the data using FastQC, and I noticed that the range of duplication levels of the samples was quite high in general (10% to 95%). I came across the ATACSeqQC paper (https://www.ncbi.nlm.nih.gov/pubmed/29490630) where they recorded duplication levels of 0.6% to 38% in their data. However, there is no information on what an acceptable level of duplication would be. Can someone please give some advice on this matter? Thanks!
P.S. I am new to ATAC-Seq data analysis. I have scanned the literature and haven't found much help on this topic.
This will end up varying strongly based upon your sequencing depth, which is why there aren't any strict thresholds. If you're looking at differential accessibility then the most important thing is that you have comparable duplication rates across samples/groups. For what it's worth, our most recent ATAC-seq samples had ~60 million reads each and had 20-25% duplication rates. That's pretty normal, in my experience. If you have >50% duplication and haven't thrown a HiSeq lane at it then likely something went wrong during library prep.
Is it a case of human genome? For a yeast with 12Mb genome I get around 55-65% of duplications with 30-40 mln reads, though I don't know whether its good or not.
Thank you all for responses. I forgot to mention that I am dealing with human samples (~60 million reads PE). We did a couple of different experiments, for e.g. there is a control versus treatment experiment. We found >50% duplication in the control and >80% in the treatment samples, which points to some issues at the sample prep itself...
I hope you removed everything other than nuclear reads?