Gam format specification
1
0
Entering edit mode
3 days ago
guntul ▴ 40

Hi everyone, I am dealing with gam file format and trying to understand its structure. I convert my gam files to JSON using vg. I don't exactly understand how the fields work. In vg file format page, the only source I could find, does not give information about the fields. I want to extract the read sequence, alignment position in the reference, cigar string, or equivalent matching details. Which fields are mandatory so I can rely on that name in the code segment? Two of the examples from my files are

{"fragment": [{"length": "-567", "name": "chr22"}], "fragment_length_distribution": "1927:443.135:148.44:0:1", "fragment_prev": {"name": "ERR903030.252281075 "}, "fragment_score": 51.0, "identity": 0.98412698412698407, "mapping_quality": 60, "name": "ERR903030.252281075 ", "path": {"mapping": [{"edit": [{"from_length": 17, "to_length": 17}], "position": {"node_id": "89533395", "offset": "15"}, "rank": "1"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "89533396"}, "rank": "2"}, {"edit": [{"from_length": 25, "to_length": 25}, {"from_length": 1, "sequence": "C", "to_length": 1}, {"from_length": 6, "to_length": 6}], "position": {"node_id": "89533397"}, "rank": "3"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "89533398"}, "rank": "4"}, {"edit": [{"from_length": 12, "to_length": 12}, {"from_length": 1, "sequence": "C", "to_length": 1}], "position": {"node_id": "89533399"}, "rank": "5"}]}, "quality": "HiAfISIiEB8PJB8PHQ8lIR0ODg4bJB8PDg8YJCYYIiYgHQ4bDg4OGw4ZDiIiISYmDhkZGR8PDxkiDiQPDyQPHyMOIh4YIg4OGQ4OGQ4YDg4iIiYbIiMiDw0iGiIaJiEODhkWDQ0XFR8NIiMiFSMNFQ0ZHxcODRciIgICAgIC", "refpos": [{"name": "chr22", "offset": "17119477"}], "score": 126, "sequence": "TCCCTGAGGTGGTGGCGGAGGTGGTGGAGGGGCGGAGGGCGGAGCACCGTAGCCCCCTCTGGCCCGACTCGGGGCGGCCCGATTGCCCCGGTCCCAGCAGCCCTCCAGGGCCTCCAGGCCCCGGCC", "time_used": 1221.0}

and

{"identity": 0.90000000000000002, "mapping_quality": 60, "name": "SRR24940081.1.1 M06097:87:000000000-L2HVL:1:1101:14851:1604 length=219", "path": {"mapping": [{"edit": [{"from_length": 2, "to_length": 2}, {"from_length": 1, "sequence": "C", "to_length": 1}, {"from_length": 16, "to_length": 16}, {"from_length": 1, "sequence": "A", "to_length": 1}], "position": {"name": "1452", "node_id": "1452", "offset": "9208"}}]}, "query_position": 96, "score": 2, "sequence": "TACTAATAAAATATGATGTA"}
pangenome gam vg • 192 views
ADD COMMENT
2
Entering edit mode
2 days ago
Jouni Sirén ▴ 470

The GAM format is essentially a concatenation of protobuf messages. See the definition of Alignment. All fields are technically optional, but some are more important than the others:

  • name and sequence describe the read.
  • path and query_position describe the alignment.
  • score and mapping_quality are the basic estimates of the quality of the alignment.

Then:

  • A path is a sequence of mappings.
  • A mapping consists of a graph position (the starting position of the mapping), a sequence of edits, and a rank (which is redundant), and it describes the alignment between a part of the read and (a part of) a node.
  • Graph positions consist of a node identifier, a flag telling if we are on the reverse strand, and an offset within the node.
  • Each edit consists of a length in the target/reference (from_length), a length in the query/read (to_length), and possibly the corresponding part of the query sequence.
  • If there are no edits, the mapping is assumed to be a match from the starting position until the end of the node or the read.

Offsets are generally 0-based. And because this is protobuf, fields with empty values ("", 0, 0.0, or false) will be missing.

ADD COMMENT

Login before adding your answer.

Traffic: 1984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6