Hi,
I am trying to obtain the number of reads starting at each position from a bam file. I've used
samtools depth -aa ${bamfile} | cut -f 3 > ${pileupFile}
to create a depth of coverage per position, and confirmed that all reads have length 100 using
samtools view ${bamfile} | awk '{print length($10)}' | sort -u
On that file I've used a simple Python implementation to obtain the number of starting reads of length 100 at each position (I'll use some faster language later, but right now I'm working on a toy example):
import numpy as np
...
pileup = np.loadtxt(infilename, dtype=int)
N = len(pileup)
readLen=100
# keep the first entry as is, and then calculate the difference pileup[i]-pileup[i-1]
pileup = np.ediff1d(pileup, to_begin=pileup[0]).astype(int)
for i in xrange(N-readLen):
pileup[i+readLen] += pileup[i]
assert pileup[i]>=0
assert pileup[i+readLen]>=0
It is easy to show that this yields the required number of starting reads at each position. This number cannot be negative, unless readLen is wrong or the pileup was not created by overlapping 100-mers. Strangely enough, I get tons of negative values. I am fairly confident that my Python script is correct.
My question is: does samtools do something fancy (other than capping at 8000, which is not the case in my case) when calculating read depth? I am using samtools 1.3.1-15-g21fe47e (using htslib 1.3.1-4-gdd40524).
EDIT: I am aware that I can easily achieve my goal using
samtools view ${bamfile} | cut -f 4 | uniq -c | awk '{print $1}' | cut -f 1 -d ' '
I am more concerned about a possible bug in samtools (unless I made a stupid mistake in my Python code).
is it ?
Yes. Using ediff1d, pileup contains the number of changes in coverage. It is equivalent to the number of reads starting there minus the number of reads ending at i-1, i.e. it undercounts the number of start reads by the ones ending right before. In the loop, pileup[i] contains the number of starting positions at i. Adding that count to pileup[i+readLen] compensates the undercounting. This is the in-place version of an algorithm that stores the number of reads ending at a given position in a separate array.