Thanks to the answer from DK and Istvan's comment I conclude that it is impossible to compute percent identity from the CIGAR string alone. This is because the CIGAR string does not contain enough information to capture Levenshtein-distance. Imo this is actually a misguided design of the string.
It might be feasible by obtaining proper tag annotation, but this way is blocked for me at the moment due to a gefault in samtools calmd.
Therefore, I will describe a way of doing the computation by re-aligning the sequences in the bam-file using R functions pairwiseAlignment
and stringDist
. I will then define the PID as the quotient of Levenshtein-distance and the length of the maximum length of query, subject.
I will put my code here as a go along.
bam = scanBam("reads.bam")
genome = read.DNAStringSet("genome.fna")
pids = numeric(0)
# calculate coordinates to cut out reference sequences
# this might not work if there are large gaps in the alignment
# on the other hand query width should capture this
start = bam[[1]]$pos - bam[[1]]$qwidth
end = bam[[1]]$pos + 2 * bam[[1]]$qwidth
start[start < 1] = 1
end[end>length(genome[[1]])] = length(genome[[1]])
for (i in 1:length(bam[[1]]$seq)) {
# do pairwise alignment against the matched sequence +- 1 query length
# according to Jeremy's comment
aln = pairwiseAlignment(bam[[1]]$seq[i], subseq(genome[[1]], start= start[i], end=end[i]), type="local")
d = stringDist(c(as.character(aln@pattern), as.character(aln@subject)), m="l", i=F)
l = max(c(width(aln@pattern),(width(aln@subject) )))
aln = NULL
cat (i);
pid = as.numeric((l-d)/l)
print(pid)
pids = c(pids,pid)
gc()
}
Edit: Use only portions of the sequence as suggested by Jeremy, that speeds it up quite a bit.
This will take some time as pairwiseDistance using Smith-Waterman is quite slow and is therefore only feasible with small bam files. For larger files, other alignment tools might be more appropriate. But this demonstrates how it works. (You have to use a loop here, this is intentional, because using lapply with pairwiseAlignment opens a memory-leak, process killed at >20GB mem usage)
Computationally, is there really a difference between skipped and deletion?
I thought that N was indicating a mismatch, or not?
None.
%identity = match-(NM)/length
S=20 M=40 I=5 D=5, length=20+40+5+5=70, match=40 ... so 57% (unless there were NM's).
Soft clipping = mismatch in most situations. Inserts & deletions are also mismatch.... in that they don't exist in the reference. Only way to get 100% is if it's ##M with no NM.