I have ATAC-seq data and I wish plot the distribution of insert size as show in this paper by Buenrostro et al, 2014 (figure 2a)
I have no problem getting the insert size for every read pair (distance between the start mapping location of R1 and the end of the mapping location of R2) and plotting the frequency of occurence of each insert size, which I thought would do, but I can't seem to reproduce the periodicity shown.
In the paper, "normalized read density" is plotted. Should I also normalize the occurences? I am wondering what could this normalization be? and why do we need to normalize in this case?
PS. My question has been partially asked in this thread. I created a new thread to focus on understanding the normalized read density.
The normalization in that case was simply the division of the obtained count per insert size by total readcount in the bam file (excluding chrM and everything unwanted). The author mentioned that a while ago in the ATAC-seq community.
At the most basic level, the reason for normalization is different samples will have different number of reads. Thus, a sample with 10 reads will have half the fragments of a sample with 20 reads, but it does not mean it worked half as well.
That paper has a more advanced normalization strategy as it pertains to Figure 2b:
First, the distribution of paired-end sequencing fragment sizes
overlapping each chromatin state
(http://www.ensembl.org/info/docs/funcgen/regulatory_segmentation.html)
were computed. The distributions were then normalized to the percent
maximal within each state and enrichment was computed relative to the
genome-wide set of fragment sizes.
I understand the rational behind normalization when working with measures coming from different samples. However, in figure 2a (the one I wish to reproduce), it seems that the distribution of insert size concerns pairs coming from a unique sample:
The insert size distribution of sequenced fragments from human
chromatin had clear periodicity of approximately 200 bp, suggesting
many fragments are protected by integer multiples of nucleosomes (Fig.
2a).
In figure 2b, indeed, distributions of insert sizes overlapping different chromatin states are normalized in order to perfrom a proper enrichement analysis, which is different from what is shown in figure 2a.
I believe figure 2a is just one sample shown as an example. I guess the idea is if you normalize the reads, you could easily compare to other samples if you wanted to do that.
The normalization in that case was simply the division of the obtained count per insert size by total readcount in the bam file (excluding chrM and everything unwanted). The author mentioned that a while ago in the ATAC-seq community.
You are right, I came across the description on other papers also. Thanks!