I have recently downloaded some publicly available ATAC-seq data. I aligned with BWA to reference genome, removed duplicate (in this instance 70% of library is duplicates), and then used picardtools to generate a fragment size distribution. However, I see a large peak at around 20bp? The library was sequenced with 75bp forward and 75bp reverse PEs. Does a 20bp insert length mean that the insert is just short? How can I check this? Presumably the reads have a lot of adapter sequence?
can you post the plot of "distribution of insert size" ? Its common to observe a sharp peak less than 100bp but you should also see a peak of 150-200bp and then around 300bp.
I have uploaded said image now
Odd plot, never seen anything like that in ATAC-seq data, and I think I've seen quite many of them. Which dataset is that, then I quickly run it through my pipeline to see if it is indeed an odd library or a technical thing to debug. Did you filter chrM before collecting insert sizes?
It is very odd - it is for a obscure species and published a few days ago.
I did not filter chrM as we do not have this information.
Ok I see. It could be that the sharp peak is some heavily-digested non-nuclear DNA like chrM (or any other organelle DNA or parasite DNA that might be in the worm. Here is how the insert sizes look for only chrM in mouse:
You also see that it accumulates at short fragment sizes as this nucleosome-free is an attractive target of the transposome. Maybe you can make a kind of pseudo-chrM by taking the mitochondrial genome of a closely related well-annotated species and include it into the reference to get rid of some of these contaminations. Or maybe take all the reads below 50bp insert size and try to assemble them to followed by sequence comparison to chrM or other organelle DNA to get an idea what it is.
I realise now that perhaps a peak at 20bp (insert size) corresponds to a fragment of -95bp?