I have paired end 150bp high throughput methylation sequencing data. I have aligned the reads using bsbolt (which uses a modified version of BWA) and now have sorted bam files.
The data is from plasma cell-free DNA and it is of interest to calculate the size of the cfDNA being captured by the read pairs (like in this paper) in order to estimate the distribution of cfDNA fragment lengths (where the average cfDNA molecule is ~160bp). I've been trying to calculate this using the insert size statistic in the bam file, but the numbers do not make sense with definitions I've seen in other places on biostars. For example, this post says I must subtract the length of the read from insert size length I get from the bam file. However, the values I get from the command I am running are all less than 150bp, which is the size of my read, so these numbers do not seem to be consistent with previous definitions of insert size.
The exact command I have been running is
samtools view -f 67 -F 2304 test.bam | cut -f9 > test_insert_sizes.txt
The -f 67
parameter was to get the first in pair of a properly paired read and the -F 2304
is to eliminate reads that are secondary or supplementary alignments. I am using samtools 1.13
I have been reading the samtools documentation for sam files (2022 edition) and it seems like the definition of TLEN may have changed (and the change seems especially relevant in my case where I might expect overlapping of R1 and R2 due to the short fragment size of cell-free DNA).
My question is, then, what is the appropriate way to calculate the length of the DNA fragment that is being captured by a read pair? I would prefer to have a number for every read pair, and not summary statistics.
Thanks!