Thanks to the answer from DK and Istvan's comment I conclude that it is impossible to compute percent identity from the CIGAR string alone. This is because the CIGAR string does not contain enough information to capture Levenshtein-distance. Imo this is actually a misguided design of the string.
It might be feasible by obtaining proper tag annotation, but this way is blocked for me at the moment due to a gefault in samtools calmd.
Therefore, I will describe a way of doing the computation by re-aligning the sequences in the bam-file using R functions pairwiseAlignment
and stringDist
. I will then define the PID as the quotient of Levenshtein-distance and the length of the maximum length of query, subject.
I will put my code here as a go along.
bam = scanBam("reads.bam")
genome = read.DNAStringSet("genome.fna")
pids = numeric(0)
start = bam[[1]]$pos - bam[[1]]$qwidth
end = bam[[1]]$pos + 2 * bam[[1]]$qwidth
start[start < 1] = 1
end[end>length(genome[[1]])] = length(genome[[1]])
for (i in 1:length(bam[[1]]$seq)) {
aln = pairwiseAlignment(bam[[1]]$seq[i], subseq(genome[[1]], start= start[i], end=end[i]), type="local")
d = stringDist(c(as.character(aln@pattern), as.character(aln@subject)), m="l", i=F)
l = max(c(width(aln@pattern),(width(aln@subject) )))
aln = NULL
cat (i);
pid = as.numeric((l-d)/l)
print(pid)
pids = c(pids,pid)
gc()
}
Edit: Use only portions of the sequence as suggested by Jeremy, that speeds it up quite a bit.
This will take some time as pairwiseDistance using Smith-Waterman is quite slow and is therefore only feasible with small bam files. For larger files, other alignment tools might be more appropriate. But this demonstrates how it works. (You have to use a loop here, this is intentional, because using lapply with pairwiseAlignment opens a memory-leak, process killed at >20GB mem usage)
Computationally, is there really a difference between skipped and deletion?
I thought that N was indicating a mismatch, or not?
None.
%identity = match-(NM)/length
S=20 M=40 I=5 D=5, length=20+40+5+5=70, match=40 ... so 57% (unless there were NM's).
Soft clipping = mismatch in most situations. Inserts & deletions are also mismatch.... in that they don't exist in the reference. Only way to get 100% is if it's ##M with no NM.