Decoding MM and ML tags in .SAM file
1
1
Entering edit mode
18 months ago
Ethan ▴ 10

Hi all,

I am trying to decode the SAM MM and ML tags read by read and create a table of methylation locations on that read. For example, from an aligned sam file:

Nanopore_Sequence_Example   4   *   0   0   *   *   0   0   
GTTATGTAACCTACTTGGTTCCATTACGTATTGCTGGTGCTGAAGATTGTAGGTGTCTTTGTGCAGAGTGTATGATATACACGGCGGTGCTGAAGAAAGTTATTGCGGGTGTATTTGTGCAGAAGTATATGATGTGCGCGGGCGGAGGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGCAGAAGTATATGATGGCGAGGTGTTGAAGAAAGTTGTCGGTGTCTTTGTGCAGAAGTATATGATGTGCGCGGGCGGATCCGCCCGCGCATCCTTCTGCGCAAT    
"&+*+*)$#$%'&&')%&'(,)))*1555:>B@7777BBCD10.*.%%&)$$$*14//0.,.-..'(%(%&''..211--'$$%&+))56<;;998:44892.-,+)('&*)'&&((,++/0064:385566;>A>=6@<@AA?;:::=>>?==7=?@=<>>;BA@??@?=:;;;7011+,,).-,++++-&$$%)(,)*,)$%'(((&&'(&&%%%&+0///20877656??BAA@@ABBEFGC>=57793222532110,50-++$$$%&),,,+())    
rl:i:0  
MM:Z:C+h?,5,5,0,1,1,0,0,1,2,0,2,0,0,1,2,0,4;C+m?,5,5,0,1,1,0,0,1,2,0,2,0,0,1,2,0,4; 
ML:B:C,159,6,135,2,7,9,3,4,13,11,6,22,6,1,2,2,218,0,4,19,1,2,7,4,1,0,1,4,15,11,2,1,0,0

I'm trying to create a table/list that would look like this:

Read: Nanopore_Sequence_Example
Methylation position 1: 5mC Methylated
Methylation position 2: Unmethylated
Methylation position 3: 5hmC Methylated

I am new to methylation calling, any help would be amazing.

samtools Methylation SAM • 5.4k views
ADD COMMENT
5
Entering edit mode
18 months ago

To start, ยง1.7 of the SAMtools documentation is a good canonical reference for MM/ML tags: https://samtools.github.io/hts-specs/SAMtags.pdf

If you're using Javascript, I wrote a bit of code here to decode MM tag deltas to offsets (i.e., offsets from zero-based indexed alignment start position): https://github.com/alexpreynolds/higlass-pileup/blob/methylation/src/bam-utils.js#L109-L250

The above code is a fork off the HiGlass higlass-pileup track, which uses the GMOD bam-js library to read BAM data and make MM tag data available, along with other per-segment properties: https://github.com/GMOD/bam-js

I also have some Python code here that uses pysam to read a chromosome of BAM data and average m6A events at 5' positions: https://github.com/alexpreynolds/decode_m6A_events

More generally for these and other methylation event categories, you might also look at pysam's modified_bases and modified_bases_forward segment properties, which appear to decode MM and ML tags to key-value pairs: https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.modified_bases

The value for each base/strand/modification tuple key includes both positional offsets and a byte-encoded probability score between 0 and 255 (I believe), which I think is derived from the ML data. I didn't know about these properties ahead of time. If you're using Python to read your BAM/SAM data, that might be a timesaver.

It might be useful to use the ML data to threshold events, but in practice I do not know what is a good cutoff, so I have been using unthresholded events for calculating fold-change between two datasets, and I use a very rough ~94% (240) cutoff for visual rendering.

However one gets methylation events decoded to offsets relative to the reference sequence, it is necessary to use CIGAR data to move offsets forwards or backwards, depending on various operations, which aligns an event to its corresponding base. The Javascript and Python projects linked above are attempts to do this.

Hope this is helpful!


July 17 2023

There is a test fork of IGV here that shows m6A and 5mC events: https://github.com/ctsa/igv/releases/tag/m6a_view_v1

IGV presents reverse-stranded methylated reads in a manner different from what I had been using for a HiGlass-based viewer. However, IGV's presentation is prefered internally and so I have recently modified the decode_m6A_events (Python) and higlass-pileup (Javascript) repositories to be congruent with their visualization approach.

Interestingly, the Python decoder uncovered an issue with the pysam API and how reverse-stranded reads must be handled. The pysam documentation seems to claim that an alignment's forward_sequence property will be reverse-complemented, if the alignment is reverse-stranded; however, this is not true and the sequence must be rc'ed manually. This lead to some spurious results.

IGV also uses a much less stringent probability threshold for visualizing methylation events. So I decode probabilities for each offset, which may be worth looking at to decide what cutoffs are appropriate, depending on your application.

The main things I have learned from this project so far are that the specification is not well written, good examples for decoding and encoding are not provided, and that it is absolutely vital to hand-check every single result that comes out of decoding for different kinds of reads when using either Python or Javascript libraries to process BAM data, as there are inconsistencies between libraries and even errors in an API's documentation.

ADD COMMENT
0
Entering edit mode

I made changes to how I decode m6A events and updated links in the answer above. Python code is in a separate repository. Javascript code is a bit more fleshed out. Hopefully this is useful for others.

ADD REPLY
0
Entering edit mode

Seems pysam(0.22.0).modified_bases doesn't consider CIGAR though.

ADD REPLY
0
Entering edit mode

That's correct, the second step is to process the CIGAR string to further decode bases to the correct location.

ADD REPLY
0
Entering edit mode

You may try modkit program from Nanopore to get the indel considered. A powerful tool.

ADD REPLY

Login before adding your answer.

Traffic: 1644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6