It appears to me that the sam format used by emboss's water disagrees with the Samtools spec for NM flag, which reads:
NM: Number of differences (mismatches plus inserted and deleted bases) between the sequence and
reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential
matches, with everything else being a mismatch.
Note this means that ambiguity codes in both
sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and
‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a
3
match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and
hard clipping (‘N’, ‘P’, ‘S’ and ‘H’ CIGAR operations) do not count as mismatches, but insertions and
deletions count as one mismatch per base.
Note that historically this has been ill-defined and both data and tools exist that disagree with this
definition.
Here is a simple demonstration of the disagreement, in which the alignment between the same 2 sequenced either holds two insertions or two mismatches, depending upon the -gapopen penalty:
$ water -gapopen 0 -aformat3 sam -filter -asequence 'ASIS:AAAAATTTTT' -bsequence 'ASIS:AAAAAGGTTT'
@HD VN:1.3
@PG ID:water VN:6.5.7.0 CL:-gapopen 0 -aformat3 sam -filter -asequence ASIS:AAAAATTTTT -bsequence ASIS:AAAAAGGTTT
asis 0 asis 1 255 5M2I3M * 0 0 AAAAAGGTTT * AS:i:40.0 NM:i:0
$ water -gapopen 10 -aformat3 sam -filter -asequence 'ASIS:AAAAATTTTT' -bsequence 'ASIS:AAAAAGGTTT'
@HD VN:1.3
@PG ID:water VN:6.5.7.0 CL:-gapopen 10 -aformat3 sam -filter -asequence ASIS:AAAAATTTTT -bsequence ASIS:AAAAAGGTTT
asis 0 asis 1 255 10M * 0 0 AAAAAGGTTT * AS:i:32.0 NM:i:2
It appears to me that NM here encodes the number of mismatches between sequence and reference, and ignores insertions.
This is in fact desirable for my purposes, but I'd like to be able to count on it.
Is there some documentation I am missing on Emboss' use of this format that might confirm this observation?
In `the water documentation I read:
The available pairwise alignment format names are: ... match, sam, bam, ...
See: http://emboss.sf.net/docs/themes/AlignFormats.html for further information on alignment formats.
However further information on sam format is not provided at AlignFormats page.
I am running linux/centos7 and using
water --version
EMBOSS:6.5.7.0
The SAM spec text you quote dates from 2018. Prior to that it merely said “Edit distance to the reference, including ambiguous bases but excluding clipping”, which is all that the EMBOSS authors would have had to go on — so just about any implementation would have fitted that broad definition.
Note that “Samtools spec” is a particularly inaccurate way to refer to the SAM Specification. The document is maintained by a committee under the aegis of GA4GH, and made available at samtools.github.io but as described at that page this GitHub organisation is shared by several groups only one of which is samtools. And of course the format is not samtools-specific.
One odd thing about the SAM spec (which I noted on the account of this thread) is that it does not seem to be versioned.
Why is that I wonder, and shouldn't it be versioned? After all the spec does change in time.
The title page of the PDF gives a datestamp and a commit hash, currently “2 Feb 2023 […] This printing is version aa7440d […]”, which we believe is the right amount of versioning for this specification document. The datestamp tells you how old your copy is and which of two copies is the newer, and the commit hash identifies the exact LaTeX source and the commits that it contains.
IMHO tags are not really useful for specification documents, because where e.g. a
sam-1.5
tag should point is different for different audiences. See hts-specs #194 and associated issues for more discussion of specification versioning than anyone could possibly want to read.Imagine that someone says: "My software has timestamps generated in the git log that show both the date and the hash of the commit, I believe that ought to be enough information to figure out what has changed " - technically, that is correct. In practice, unfeasible.
In my opinion, the SAM spec should have versions bumped at each time it materially changes. For example, if X becomes a valid CIGAR operator, that should be a bump in the version with a description of the change. Like any normal standard would.
The entire bioinformatics world runs on the format - what hope is there for any standardization even the format subtly changes in time.
PS: in related news I have recently found out that you can't even VCF eval the GIAB releases from 2016 with newer ones because the VCF files released by GIAB are NOT compatible ... that is the world we live in
A specification document is not software.
The SAM format has a version number that is bumped when incompatible changes occur. X has been a valid CIGAR operation for so long that it's not a good example (but see hts-specs #743). A better example is the B array aux field type: if you care to read the Version History appendices in the two SAM specifications (SAMv1 and SAMtags), you'll see that the addition of B array syntax caused a version number bump to
VN:1.4
. Like any normal standard would.Thank you for your interest, but this is not a good venue for this discussion. The current open issue on this topic is hts-specs #725; interested readers should also read the other issues linked from there before contributing.
If you wanted to file an issue describing the specific problems you encountered with GIAB VCF files, that would be helpful.
Beyond what @Istvan said I am not sure EMBOSS is actively being developed so you may not need to worry about the "feature" that you like changing. I am not able to reach the download site for EMBOSS code today (http://emboss.open-bio.org/pub/EMBOSS/ )
Release history here: https://emboss.sourceforge.net/developers/changelog.html has the last release which is ~10 yr old at this point. One you have a bit newer but perhaps not very new.
Thanks. Indeed it is listed now as one of OBF's retired projects. It was most useful. Sorry to see its loss of support.
EMBOSS is the epitome of a tool way-way ahead of its time - but also implemented with such an unusably bad interface that it hamstrung and cursed itself from the start.
There is a lesson there - that no matter how good a programmer one is (EMBOSS programmers tackled problems at times when these problems were far more difficult) - you have to pay attention to usability before anything else. Does not matter that you have a great idea and you are first - if the execution is bad.
Empowering others is what every tool programmer should aim for, and that means not reinventing interfaces.
Every time I use EMBOSS, I feel like it is working the opposite way I expect it to. The docs have always been awkward and backward too. So few people used these tools even when there were few alternatives.