Hello all,
From a bam (or sam) file, by looking at the MD:Z field
, we could identify the position of mismatches. For example, if the cigar string is 45M30N35M
with the mismatches occuring at positions 5(A->T)
and 65(C->G)
, then the MD flag would be MD:Z:4A59C15
to reflect the mismatched bases at 5th and 65th position. Of course it gets a bit complicated if there are consecutive mismatches and/or deletions followed by mismatches. If you're interested, this post explains it very well.
What I am interested in is, given the cigar string and MD:Z tag, to obtain the position of mismatches in a vector. In this case it would be 5 and 65. I could implement it myself, but I am half-minded about it (due to time restrictions) and was wondering if any of the already existing R-packages (like GenomicRanges and such) have a way of obtaining this info directly. Are there any packages anyone is aware of?
Thank you very much in advance. And I wish the biostars forum a Merry Christmas and a very happy new year!
Hi Pierre, Thanks for the code. I've trouble running it (the execute step).
Error message: javac: invalid flag: reference.fa
. Also I managed to do this in R as its a part of my pipeline. It'd be great to get this running however.I've added a line: with "setDefaultValidationStringency"
you're right. It works like a charm. Thank you.
Hello Pierre, where can I get sam.jar?
see my "update' above . there is no more sam.jar, this is now a standalone software: http://lindenb.github.io/jvarkit/Biostar59647.html