sam format in the emboss water program
1
1
Entering edit mode
13 months ago
Malcolm.Cook ★ 1.5k

It appears to me that the sam format used by emboss's water disagrees with the Samtools spec for NM flag, which reads:

NM: Number of differences (mismatches plus inserted and deleted bases) between the sequence and
reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential
matches, with everything else being a mismatch.
Note this means that ambiguity codes in both
sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and
‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a
3
match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and
hard clipping (‘N’, ‘P’, ‘S’ and ‘H’ CIGAR operations) do not count as mismatches, but insertions and
deletions count as one mismatch per base.
Note that historically this has been ill-defined and both data and tools exist that disagree with this
definition.

Here is a simple demonstration of the disagreement, in which the alignment between the same 2 sequenced either holds two insertions or two mismatches, depending upon the -gapopen penalty:

$ water -gapopen 0 -aformat3 sam -filter -asequence 'ASIS:AAAAATTTTT'  -bsequence 'ASIS:AAAAAGGTTT' 
@HD VN:1.3
@PG ID:water    VN:6.5.7.0  CL:-gapopen 0 -aformat3 sam -filter -asequence ASIS:AAAAATTTTT -bsequence ASIS:AAAAAGGTTT
asis    0   asis    1   255 5M2I3M  *   0   0   AAAAAGGTTT  *   AS:i:40.0   NM:i:0


$ water -gapopen 10 -aformat3 sam -filter -asequence 'ASIS:AAAAATTTTT'  -bsequence 'ASIS:AAAAAGGTTT' 
@HD VN:1.3
@PG ID:water    VN:6.5.7.0  CL:-gapopen 10 -aformat3 sam -filter -asequence ASIS:AAAAATTTTT -bsequence ASIS:AAAAAGGTTT
asis    0   asis    1   255 10M *   0   0   AAAAAGGTTT  *   AS:i:32.0   NM:i:2

It appears to me that NM here encodes the number of mismatches between sequence and reference, and ignores insertions.

This is in fact desirable for my purposes, but I'd like to be able to count on it.

Is there some documentation I am missing on Emboss' use of this format that might confirm this observation?

In `the water documentation I read:

The available pairwise alignment format names are: ... match, sam, bam, ...

See: http://emboss.sf.net/docs/themes/AlignFormats.html for further information on alignment formats.

However further information on sam format is not provided at AlignFormats page.

I am running linux/centos7 and using

water --version
EMBOSS:6.5.7.0
emboss alignment sam water • 1.8k views
ADD COMMENT
2
Entering edit mode

The SAM spec text you quote dates from 2018. Prior to that it merely said “Edit distance to the reference, including ambiguous bases but excluding clipping”, which is all that the EMBOSS authors would have had to go on — so just about any implementation would have fitted that broad definition.

Note that “Samtools spec” is a particularly inaccurate way to refer to the SAM Specification. The document is maintained by a committee under the aegis of GA4GH, and made available at samtools.github.io but as described at that page this GitHub organisation is shared by several groups only one of which is samtools. And of course the format is not samtools-specific.

ADD REPLY
0
Entering edit mode

One odd thing about the SAM spec (which I noted on the account of this thread) is that it does not seem to be versioned.

Tags - nothing to show

Why is that I wonder, and shouldn't it be versioned? After all the spec does change in time.

ADD REPLY
0
Entering edit mode

The title page of the PDF gives a datestamp and a commit hash, currently “2 Feb 2023 […] This printing is version aa7440d […]”, which we believe is the right amount of versioning for this specification document. The datestamp tells you how old your copy is and which of two copies is the newer, and the commit hash identifies the exact LaTeX source and the commits that it contains.

IMHO tags are not really useful for specification documents, because where e.g. a sam-1.5 tag should point is different for different audiences. See hts-specs #194 and associated issues for more discussion of specification versioning than anyone could possibly want to read.

ADD REPLY
0
Entering edit mode

Imagine that someone says: "My software has timestamps generated in the git log that show both the date and the hash of the commit, I believe that ought to be enough information to figure out what has changed " - technically, that is correct. In practice, unfeasible.

In my opinion, the SAM spec should have versions bumped at each time it materially changes. For example, if X becomes a valid CIGAR operator, that should be a bump in the version with a description of the change. Like any normal standard would.

The entire bioinformatics world runs on the format - what hope is there for any standardization even the format subtly changes in time.

PS: in related news I have recently found out that you can't even VCF eval the GIAB releases from 2016 with newer ones because the VCF files released by GIAB are NOT compatible ... that is the world we live in

ADD REPLY
0
Entering edit mode

A specification document is not software.

The SAM format has a version number that is bumped when incompatible changes occur. X has been a valid CIGAR operation for so long that it's not a good example (but see hts-specs #743). A better example is the B array aux field type: if you care to read the Version History appendices in the two SAM specifications (SAMv1 and SAMtags), you'll see that the addition of B array syntax caused a version number bump to VN:1.4. Like any normal standard would.

Thank you for your interest, but this is not a good venue for this discussion. The current open issue on this topic is hts-specs #725; interested readers should also read the other issues linked from there before contributing.

If you wanted to file an issue describing the specific problems you encountered with GIAB VCF files, that would be helpful.

ADD REPLY
1
Entering edit mode

Beyond what @Istvan said I am not sure EMBOSS is actively being developed so you may not need to worry about the "feature" that you like changing. I am not able to reach the download site for EMBOSS code today (http://emboss.open-bio.org/pub/EMBOSS/ )

Release history here: https://emboss.sourceforge.net/developers/changelog.html has the last release which is ~10 yr old at this point. One you have a bit newer but perhaps not very new.

ADD REPLY
0
Entering edit mode

Thanks. Indeed it is listed now as one of OBF's retired projects. It was most useful. Sorry to see its loss of support.

ADD REPLY
0
Entering edit mode

EMBOSS is the epitome of a tool way-way ahead of its time - but also implemented with such an unusably bad interface that it hamstrung and cursed itself from the start.

There is a lesson there - that no matter how good a programmer one is (EMBOSS programmers tackled problems at times when these problems were far more difficult) - you have to pay attention to usability before anything else. Does not matter that you have a great idea and you are first - if the execution is bad.

Empowering others is what every tool programmer should aim for, and that means not reinventing interfaces.

Every time I use EMBOSS, I feel like it is working the opposite way I expect it to. The docs have always been awkward and backward too. So few people used these tools even when there were few alternatives.

ADD REPLY
0
Entering edit mode
13 months ago

In your cited text you have the following note regarding NM

Note that historically this has been ill-defined and both data and tools exist that disagree with this definition.

There you have your answer.

I believe that older versions of the Samtools spec may have not fully defined NM

ADD COMMENT
0
Entering edit mode

Thanks, but, that note doesn't really answer my question, which was "Is there some documentation I am missing on Emboss' use of this format that might confirm this observation?"

ADD REPLY
1
Entering edit mode

What I meant is that knowing that NM may be defined in different ways - seeing that being called out explicitly - coupled with the observation that NM here seems to be using a nonstandard definition confirms (at least in my mind) that they chose to implement an alternative way of computing NM.

I would say this is a bug that should be reported.

Not sure there is anyone there that would fix it, though.

ADD REPLY
0
Entering edit mode

indeed ... nobody home!

This is a delivery failure notification message indicating that an email you addressed to email address :
-- emboss@emboss.open-bio.org

could not be delivered. The problem appears to be :
-- Recipient server unavailable or busy

Additional information follows :
-- Connection timed out (Connection timed out)

This condition occurred after 30 attempt(s) to deliver over a period of 93 hour(s).
ADD REPLY

Login before adding your answer.

Traffic: 2368 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6