Question

Fragment Size: TLEN vs. isize

8

Entering edit mode

10.5 years ago

Rob 6.9k

Hi,

I'm in the process of adding (optional) support for using alignments to our tool for RNA-seq quantification, and I'm currently trying to model the fragment size distribution of aligned fragments in a set of reads. I'm a bit confused about exactly where this information is stored in a bam/sam file. According to the spec, the SAM file has a TLEN field that gives the "template length" of an alignment, which seems to me like it would be equivalent to the fragment length (except in the case of e.g. a chimeric alignment). However, the API doesn't expose this field directly, and instead in SAM and BAM files parsed via samtools, you have access to an isize field, which, from various online sources seems to be the "insert size" of the aligned fragment.

My question is how do these two fields relate? Are they different? Is the insert size the fragment size, the mate inner distance (frag. size - read lengths), or something else entirely? Unfortunately, the documentation on this isn't too clear, so I'm reaching out to someone whose dealt with this in practice before.

Thanks!

Rob

sequencing alignment next-gen • 18k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Rob 6.9k

0

Entering edit mode

I am really confused by this statement: "instead in SAM and BAM files parsed via samtools, you have access to an isize field,"

Where do you find isize in the specs? I see it for block compression but that's it.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Sure. In the spec, they describe TLEN.; However, if you look at the Samtools API, the relevant information is in the bam1_core_t structure. Basically, I can't seem to find any mapping in the API for accessing the TLEN field, but a few people online seem to suggest that this is encoded in the isize field of the bam1_core_t structure.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.5 years ago by Rob 6.9k

Ram · Accepted Answer · 2014-07-10

5

Entering edit mode

10.5 years ago

Devon Ryan 105k

The TLEN value itself is actually isize. Yes, this can be a bit confusing.

Edit: I removed an incorrect part about TLEN.

ADD COMMENT • link 10.5 years ago by Devon Ryan 105k

2

Entering edit mode

The word "insert size" is one of the terms that is almost never defined yet people keep using it quite liberally. It is very frustrating indeed. For example note how even the bwa manual uses the word insert size, whereas the samtools spec uses the work template length to describe the 9th column of the SAM file.

The samtools definition on TLEN says:

If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base.

I have come to believe that isize is the same as template length and it corresponds to the length of the DNA fragment that was enclosed (inserted) between the sequencing adaptors.

No format or API should ever directly report any other measure that depends on the read lengths. The whole insert size concept that involves adding read lengths seems ill defined, perhaps a remnant of an older time when we did not know better. If reads were to overlap then the insert size would need to be negative ...

that being said who knows what that API reports ...

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.5 years ago by Istvan Albert 102k

0

Entering edit mode

Well, the API reports whatever was in the SAM or BAM file to begin with. I actually just edited by answer since I now recall that the whole "insert size plus read lengths" thing was specific to one aligner (can't remember which at the moment). Since the API uses isize, the spec should just be changed to match that (I'll submit a pull request if I have time).

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks for the quick and clear answer, Devon!

ADD REPLY • link 10.5 years ago by Rob 6.9k

0

Entering edit mode

I am trying to calculate the length fragments from a given BAM file and I still find this confusing. The current SAM spec says:

TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign.

To me, this clearly reads fragment length. But if I understand your answer, TLEN is not fragment length. Maybe I should open a new question?

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.2 years ago by Henrik Mannerström • 0

0

Entering edit mode

It'll depend on whether the alignments are spliced or not. If you're mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same. For RNAseq data this is often not the case. What TLEN reports is the distance from the 5'-most to 3'-most position.

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you for your quick reply! How should I interpret this BAM, which should be mapped DNA sequences:

CHM040_CHM053.5455421:1 65      chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0
CHM040_CHM053.5455421:1 129     chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0

What I see here is a fragment of 20bp. However, when I run this through samtools fixmate it comes out as

CHM040_CHM053.5455421:1 65      chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0  MQ:i:255
CHM040_CHM053.5455421:1 129     chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0  MQ:i:255

Which I take as samtools (Version: 1.2, using htslib 1.2.1) reporting zero length fragment.

So TLEN is not fragment length?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Henrik Mannerström • 0

0

Entering edit mode

Samtools reports whatever the aligner told it, so the question is, "Why did my aligner think that the fragment length is 0". The answer is that the alignment took place in a biologically impossible way, so there's no meaningful way to represent the impossibility.

BTW, samtools fixmate will set TLEN to 0 for these biologically impossible situations.

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

First, I am really grateful for you taking time to explain this! My next question is: What should a 20bp PET alignment look like then? To clarify, I am in the belief that the reported positions in the SAM file are leftmost positions, so when the fragment length is equal to the read length, a PET might actually get the same position for both mates. BTW, the BAM public data (GSE33664). PS, this margin is getting smaller and smaller, we could switch to a new thread if you like.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Henrik Mannerström • 0

0

Entering edit mode

The width won't shrink much more.

A 20 base fragment would be:

CHM040_CHM053.5455421:1 99      chr10   38804303        255     20M     =       38804303        20       GAATGGAATCAACTCGATTG    *       NM:i:0
CHM040_CHM053.5455421:1 147     chr10   38804303        255     20M     =       38804303        20       GAATGGAATCAACTCGATTG    *       NM:i:0

Note the flags.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

OK, thanks! I see that SAM expects one of the mates to be reverse-complemented, and since this is ChIA-PET data, and for whatever other reasons, this is not so.

ADD REPLY • link 9.2 years ago by Henrik Mannerström • 0

0

Entering edit mode

You aligner expects that, the SAM format itself doesn't really care (though samtools fixmate will only fix TLEN if this is the case). You'll need to explicitly tell your aligner what constitutes a biologically valid alignment is the mates aren't pointed toward each other.

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

The reason it is reported as zero is that, from the alignment alone we don't know how long the template length actually is.

|-------->
   |--------->

Your situation is reported like the image above. We know where the template started, we know that it must be equal or longer than the end of the read that extends further but we don't know how long the template actually is. In the most general case only the 5' ends represent fragment starts, the 3' ends are instrument dependent and terminate sooner than the fragment.

So the aligner resorts to reporting it as zero (or something else) but when both reads map to the same strand no matter what it reports it will likely not be correct in the most general case. In your case you have to write a custom tool to extract the proper template lengths based on information specific to the experiment.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Istvan Albert 102k