I have various BAM files for which I want to plot the distribution of fragment lengths. I know that deepTools has the bamPEFragmentSize tool, but this only samples the reads from a binned genome. We have relatively small files--maybe only 100k reads that map to the viral genome of interest--so the binning and sampling causes a loss of resolution.
I wanted to just take the 9th column from the SAM file, then turn that into a Numpy array and create a histogram from there. However, I'm getting a histogram that doesn't make sense to me. I would like fragment length on the x axis and frequency of that fragment length on the y axis.
After cutting the 9th column from the SAM file, I loaded this file into Python and converted it into a Numpy array:
samFile="ninth-column.txt"
#read file line by line, then remove /n
with open(samFile) as f:
content = f.readlines()
content = [x.strip() for x in content]
#convert string list to integer list
for i in range(0, len(content)):
content[i] = int(content[i])
content[0:10]
>>> [68, 0, 0, 147, 347, 177, 347, 246, 245, -68]
#convert integer list to numpy array
arr = np.array(content)
abArr = np.absolute(arr)
print(abArr[0:10])
>>>[ 68 0 0 147 347 177 347 246 245 68]
#make histogram
readLengths = np.histogram(abArr)
readLengths
>>>(array([39010090, 111, 68, 59, 41, 15,
10, 16, 7, 4]),
array([0.00000000e+00, 2.27951284e+07, 4.55902568e+07, 6.83853852e+07,
9.11805136e+07, 1.13975642e+08, 1.36770770e+08, 1.59565899e+08,
1.82361027e+08, 2.05156156e+08, 2.27951284e+08]))
plt.hist(readLengths)
plt.show()