SAM file: what does "template" mean?
2
1
Entering edit mode
10.1 years ago
bongbang ▴ 90

According to the spec a template is "A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences." But what does that mean? The first field "QNAME" is explained as "Query template NAME." I know that usually means read name, so what does "template" have to do with it? Also from the spec, "RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the next read is the first read in the template. " Does template just me "file" then? But then a template can be "single-segment," in which case TLEN is set to 0. Try as I might, I can't make sense of that. Your explanation will be much appreciated.

sam • 5.8k views
ADD COMMENT
2
Entering edit mode
10.1 years ago

In most cases, templates are reads or pairs of them. However, you could imagine a case where you assembled reads into contigs and then aligned those against a reference genome. Then the template would be the contig.

RNEXT is the name of the chromosome or contig to which the next template in a pair (or larger group) aligns.

ADD COMMENT
7
Entering edit mode
10.1 years ago

The concept of template is somewhat obscure - and I for one always have a nagging feeling that I could be missing/misunderstanding something about it. Let's start with the definition:

Template: A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.

I think this is far from being an optimal. In that single sentence there are a lots of undefined concepts that don't help at all: sequencing machines, sequencing process, raw sequences and assembly. Those just muddy the water. In my mind the correct definition would be:

Template: A DNA/RNA sequence from which one or more parts will be represented in the file.

Then let's move on, the SAM spec says:

Read: A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced.

I think this is could do with clarifications as well. What happens here is that concept of segment is conflated with the concept of alignment. A read does not actually consist of multiple segments. The full sequence of a read read may be aligned to produce different locally aligned regions and the sequence for these aligned regions will be called a segment. The key here is that a read will not a-priory consist of segments as stated above! The presence or absence of segments will depend solely on the aligner.

Now what does not help is that even though the segments were defined relative to a read in the rest of the spec segments are almost always discussed relative to the template: for example "next segment in the template". This can make reading the spec very confusing.

My mental model hierarchy is the following:

  1. Template --> The DNA fragment that was measured
  2. Reads --> Depending on the methodology a template may produce one or more reads. These reads may cover the entire template or just a subsection of it. Reads originating from the same template typically cover different parts of the template, and, may represent the template itself or the reverse complement of it.
  3. Segments --> Each read may produce one or more alignments that in turn will have aligned regions called segments. From these segments it may be possible the infer the size of the original template.

PS: I do realize that it is quite difficult to write unambiguous specification that also makes sense, plus the spec was done many years ago without the benefit of hindsight

ADD COMMENT

Login before adding your answer.

Traffic: 1749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6