Question

Histogram of Paired End Fragment Lengths

0

Entering edit mode

3.4 years ago

ss3943 • 0

I have various BAM files for which I want to plot the distribution of fragment lengths. I know that deepTools has the bamPEFragmentSize tool, but this only samples the reads from a binned genome. We have relatively small files--maybe only 100k reads that map to the viral genome of interest--so the binning and sampling causes a loss of resolution.

I wanted to just take the 9th column from the SAM file, then turn that into a Numpy array and create a histogram from there. However, I'm getting a histogram that doesn't make sense to me. I would like fragment length on the x axis and frequency of that fragment length on the y axis.

After cutting the 9th column from the SAM file, I loaded this file into Python and converted it into a Numpy array:

samFile="ninth-column.txt"
#read file line by line, then remove /n
with open(samFile) as f:
    content = f.readlines()
content = [x.strip() for x in content]

#convert string list to integer list
for i in range(0, len(content)):
    content[i] = int(content[i])
content[0:10]
>>> [68, 0, 0, 147, 347, 177, 347, 246, 245, -68]

#convert integer list to numpy array
arr = np.array(content)
abArr = np.absolute(arr)
print(abArr[0:10])
>>>[ 68   0   0 147 347 177 347 246 245  68]

#make histogram
readLengths = np.histogram(abArr)
readLengths
>>>(array([39010090,      111,       68,       59,       41,       15,
              10,       16,        7,        4]),
 array([0.00000000e+00, 2.27951284e+07, 4.55902568e+07, 6.83853852e+07,
        9.11805136e+07, 1.13975642e+08, 1.36770770e+08, 1.59565899e+08,
        1.82361027e+08, 2.05156156e+08, 2.27951284e+08]))

plt.hist(readLengths)

plt.show()

enter image description here

numpy python histogram • 1.1k views

ADD COMMENT • link updated 3.4 years ago by liorglic ★ 1.4k • written 3.4 years ago by ss3943 • 0

score 0 · Answer 1 · 2021-06-26

0

Entering edit mode

3.4 years ago

liorglic ★ 1.4k

Almost there! I think this will give you what you want:

plt.hist(abArr)
plt.show()

Also note the 0 values - these probably arise from unmapped reads (or mates), so you might want to filter your reads before running your code.

ADD COMMENT • link 3.4 years ago by liorglic ★ 1.4k