Hi guys, thanks in advance for help.
I currently calculate the on target depth of coverage for a sample by using samtools depth:
samtools depth -a -b bedfile.bed dupmrked_sorted_bamfile.bam > output_bamdepth.txt
This outputs all bases' depth (even those with 0 coverage) within the regions in the bedfile, excluding duplicates.
To get a mean value I just run:
awk '{s+=$3}END{print s/NR}' output_bamdepth.txt
Which sums the depth from the samtools output and divides by the number of covered bases which would give me an average depth.
My issue is that the input reads are paired end reads. By the time we sort and mark duplicates with picard, the final bam file is in single read format, and the read could be forward or reverse. This means that a position may be covered by both the forward and reverse read and therefore may be double counted sometimes, and sometimes it may not.
This is just as much a biological problem as it is a computer problem. We have not tried to optimize for shearing size in comparison to read length which would end up providing us with a sweet spot.
If we have a fragment for sequencing that is 250BP and we do paired end 150 Sequencing, then R1 will hit some of the read, and R2 will hit some of the read and then some of the read will be overlapped by R1 and R2 (from the same read pair). This overlap region will give false double coverage. Does anyone have a way for overcoming this?
Thanks, Dennis
You could merge-overlap the paired end reads (eg with FLASH, BBmerge, ...) and then use the merged reads and process them as single end
I appreciate that, but I am not looking to modify the reads or generate new data, I think this can be done using bam header information but I am not sure just how.