I am calculating coverage over the CDS from a bam file using samtools depth
and Rsamtools pileup
. I expected the result to be the same, or at least very similar, but that hasn't quite been the case.
With samtools
, I am counting bases covered by at least one read like so:
grep $gene_id $gff | grep CDS | gff2bed | \
samtools depth -aa -Q 10 -b - $bam_path | \
cut -f3 | grep -v 0 | wc -l
The output, prior to the line starting cut...
looks like this:
❯ head ~/Desktop/tmp/coverage_test_out.txt
CP022326.1 650760 5
CP022326.1 650761 5
CP022326.1 650762 5
CP022326.1 650763 5
CP022326.1 650764 5
CP022326.1 650765 5
CP022326.1 650766 5
CP022326.1 650767 5
CP022326.1 650768 12
CP022326.1 650769 12
...
With Rsamtools::pileup
, I am doing the following:
#'
#' @param bamfile_path path to a bam file
#' @param gr a GRanges object with the range(s) of interest
#' @param strandedness one of c("stranded", "unstranded"). NOTE: forward only strand NOT currently configured
#' @param quality_threshold default 10L, equivalent to current HTSeq default
#'
#' @return number of covered bases in CDS
#'
calculateCoverage_test = function(bamfile_path, gr, strandedness, quality_threshold=10L,
coverage_threshold=1, lts_align_expr_prefix=Sys.getenv("LTS_ALIGN_EXPR_PREFIX"),
bamfile_suffix='"_sorted_aligned_reads_with_annote.bam"'){
bamfile_index = paste0(bamfile_path, ".bai")
p_param <- PileupParam(min_mapq = quality_threshold,
min_nucleotide_depth = coverage_threshold,
distinguish_strands=TRUE,
distinguish_nucleotides=FALSE)
# create the scanBamParam object.
gene_strand = as.character(unique(data.frame(gr)$strand))
minus_strand_flag = NA
if (strandedness=='reverse' & gene_strand == '+'){
minus_strand_flag = TRUE
} else if (strandedness=='reverse' & gene_strand == '-'){
minus_strand_flag = FALSE
}
sbp = ScanBamParam(which=gr,
flag=scanBamFlag(isMinusStrand=minus_strand_flag,
isSecondaryAlignment=FALSE,
isNotPassingQualityControls=FALSE,
isSupplementaryAlignment=FALSE))
coverage_df = pileup(bamfile_path,
index = bamfile_index,
scanBamParam = sbp,
pileupParam = p_param)
# take only distinct positions, and create a boolean vector
# based on whether the count for a given position is > 0.
# Sum over the boolean vector
sum(coverage_out %>%
distinct(pos, .keep_all=TRUE) %>%
pull(count) > 0)
coverage_df
, before the sum()
cmd, looks like this:
> head(coverage_out)
seqnames pos strand count which_label
1 CP022326.1 650760 - 5 CP022326.1:650760-650894
2 CP022326.1 650761 - 5 CP022326.1:650760-650894
3 CP022326.1 650762 - 5 CP022326.1:650760-650894
4 CP022326.1 650763 - 5 CP022326.1:650760-650894
5 CP022326.1 650764 - 5 CP022326.1:650760-650894
6 CP022326.1 650765 - 5 CP022326.1:650760-650894
In both cases, I am then dividing by the length of the gene. Thankfully, the gene length does come out the same from both tools.
However, the RSamtools
method returns 167 more covered bases than the samtools depth
command. That is a difference between 98% covered and 89% covered, respectively. In looking at the browser, it is hard to tell -- this gene is supposed to be knocked out, and obviously is not. The difference between 89% and 98% covered is pretty close, to my eye. That said, I'd like to know what I'm doing that is causing the difference.
If anyone has any suggestions or ideas, I would appreciate it.
oh my...in looking at this again, I just realized what a stupid mistake that was. I am going to leave this up, just in case someone else is this much of a bonehead.
Clearly,
grep -v 0
removes anything with a zero in it...which means that something with '30' counts gets removed.