Question

Gam format specification

0

Entering edit mode

3 days ago

guntul ▴ 40

Hi everyone, I am dealing with gam file format and trying to understand its structure. I convert my gam files to JSON using vg. I don't exactly understand how the fields work. In vg file format page, the only source I could find, does not give information about the fields. I want to extract the read sequence, alignment position in the reference, cigar string, or equivalent matching details. Which fields are mandatory so I can rely on that name in the code segment? Two of the examples from my files are

{"fragment": [{"length": "-567", "name": "chr22"}], "fragment_length_distribution": "1927:443.135:148.44:0:1", "fragment_prev": {"name": "ERR903030.252281075 "}, "fragment_score": 51.0, "identity": 0.98412698412698407, "mapping_quality": 60, "name": "ERR903030.252281075 ", "path": {"mapping": [{"edit": [{"from_length": 17, "to_length": 17}], "position": {"node_id": "89533395", "offset": "15"}, "rank": "1"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "89533396"}, "rank": "2"}, {"edit": [{"from_length": 25, "to_length": 25}, {"from_length": 1, "sequence": "C", "to_length": 1}, {"from_length": 6, "to_length": 6}], "position": {"node_id": "89533397"}, "rank": "3"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "89533398"}, "rank": "4"}, {"edit": [{"from_length": 12, "to_length": 12}, {"from_length": 1, "sequence": "C", "to_length": 1}], "position": {"node_id": "89533399"}, "rank": "5"}]}, "quality": "HiAfISIiEB8PJB8PHQ8lIR0ODg4bJB8PDg8YJCYYIiYgHQ4bDg4OGw4ZDiIiISYmDhkZGR8PDxkiDiQPDyQPHyMOIh4YIg4OGQ4OGQ4YDg4iIiYbIiMiDw0iGiIaJiEODhkWDQ0XFR8NIiMiFSMNFQ0ZHxcODRciIgICAgIC", "refpos": [{"name": "chr22", "offset": "17119477"}], "score": 126, "sequence": "TCCCTGAGGTGGTGGCGGAGGTGGTGGAGGGGCGGAGGGCGGAGCACCGTAGCCCCCTCTGGCCCGACTCGGGGCGGCCCGATTGCCCCGGTCCCAGCAGCCCTCCAGGGCCTCCAGGCCCCGGCC", "time_used": 1221.0}

and

{"identity": 0.90000000000000002, "mapping_quality": 60, "name": "SRR24940081.1.1 M06097:87:000000000-L2HVL:1:1101:14851:1604 length=219", "path": {"mapping": [{"edit": [{"from_length": 2, "to_length": 2}, {"from_length": 1, "sequence": "C", "to_length": 1}, {"from_length": 16, "to_length": 16}, {"from_length": 1, "sequence": "A", "to_length": 1}], "position": {"name": "1452", "node_id": "1452", "offset": "9208"}}]}, "query_position": 96, "score": 2, "sequence": "TACTAATAAAATATGATGTA"}

pangenome gam vg • 192 views

ADD COMMENT • link updated 2 days ago by Jouni Sirén ▴ 470 • written 3 days ago by guntul ▴ 40

score 2 · Answer 1 · 2024-11-22

The GAM format is essentially a concatenation of protobuf messages. See the definition of Alignment. All fields are technically optional, but some are more important than the others:

name and sequence describe the read.
path and query_position describe the alignment.
score and mapping_quality are the basic estimates of the quality of the alignment.

Then:

A path is a sequence of mappings.
A mapping consists of a graph position (the starting position of the mapping), a sequence of edits, and a rank (which is redundant), and it describes the alignment between a part of the read and (a part of) a node.
Graph positions consist of a node identifier, a flag telling if we are on the reverse strand, and an offset within the node.
Each edit consists of a length in the target/reference (from_length), a length in the query/read (to_length), and possibly the corresponding part of the query sequence.
If there are no edits, the mapping is assumed to be a match from the starting position until the end of the node or the read.

Offsets are generally 0-based. And because this is protobuf, fields with empty values ("", 0, 0.0, or false) will be missing.