Entering edit mode
7.4 years ago
verne91
▴
20
E00247:267:HMVT3CCXX:4:2219:5172:54541 321 chr1 199910236 45 85H43M chrUn_gl000220 136768 -235 AGAGAGCAAGAAAAAGAAAGAGAGAGAGAGAGAGAGAGAGACA ,7,,,A,,7,A,A,A,A7FAFF,AAF7,,<AAFF7<7<F7A,< QT:Z:AAFFFKKK BC:Z:GCCACGCT QX:Z:AAFFFKKKKKKKKKKK AM:A:0 XM:A:0 TR:Z:AGAGAGA TQ:Z:KKKKKKK AS:f:-116 RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:4 XS:f:-107.5 SA:Z:chr12,66982976,-,54M74S,2,0; BX:Z:TTCGCCACAATTAGAG-1 XT:i:0 RX:Z:TTCGCCACAATTAGAG OM:i:36 MI:i:1550405
E00247:267:HMVT3CCXX:4:2219:5172:54541 81 chr12 66982977 2 74S54M chrUn_gl000220 136768 385 TGTCTCTCTCTCTCTCTCTCTCTCTTTCTTTTTCTTGCTCTCTACCGCTCTCTGTCTAGCTCGGTCTGTGTGTGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTGTCTGTCTGTCTCTCTCTCTCTC <,A7F<7<7FFAA<,,7FAA,FFAF7A,A,A,A,7,,A,,,7,,(,,A,,,A,7,,7,(A,,7,,,A,,,A,,,7,A,,,<A,,77,KKA,7,,,7,KF7KKKKF,KKF,KKKKKKKKKKKKKKKKKK QT:Z:AAFFFKKK BC:Z:GCCACGCT QX:Z:AAFFFKKKKKKKKKKK AM:A:0 XM:A:0 TR:Z:AGAGAGA TQ:Z:KKKKKKK AS:f:-106.5 RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:4 XS:f:-106.5 SA:Z:chr1,199910235,+,85H43M,36,2; BX:Z:TTCGCCACAATTAGAG-1 XT:i:0 RX:Z:TTCGCCACAATTAGAG OM:i:2
These two reads come from NA12878 10X Genomics whole genome sequencing data.
Previous examples:
E00247:267:HMVT3CCXX:8:1202:19624:62663 83 chr1 9996 13 52S76M = 10004 -68 GAAGACGGCATACGCGATTCACCCTTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC ,,,<KFAFF<KKF<,FFA,A,,AA,AAFAFAF,<FA<<KFAKKF<FAFK<FFAKF,,F7,F7FFF<KFKFA7KFFFFAKKKKA,KFFK<AKKKKFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKK QT:Z:AAFFFKKK BC:Z:AAGGGTGA QX:Z:AAFFFKKKKKKKKKKK AM:A:0 XM:A:1 TR:Z:GGGTTAG TQ:Z:KKKKKKK AS:f:-76 RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:8 XS:f:-76.5 BX:Z:AAGTGGGTCCATCGTC-1 XT:i:0 RX:Z:AAGTGGGTCCATCGTC OM:i:0
E00247:267:HMVT3CCXX:8:1202:19624:62663 163 chr1 10004 0 75M76S = 9996 68 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCGACGATAGACCCACATAGAACGCCAGAGCGTCGTGCAGAGAAAGAGTCTAGATCTCGGTGATCGCCGTATCATTAA AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKFKKKKKKKKKKKKKKKKK,FFKKKFAKFKA,,,(,,,,7FKK<F,,,,,,,<<,A(,AA,,,(,,7<<,,,,77,<,,,<7,,77A<7,,,,,<AFF(,,A7,7,, QT:Z:AAFFFKKK BC:Z:AAGGGTGA QX:Z:AAFFFKKKKKKKKKKK AM:A:0 XM:A:1 RX:Z:AAGTGGGTCCATCGTC AS:f:-76 RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:8 XS:f:-76.5 BX:Z:AAGTGGGTCCATCGTC-1 XT:i:0 OM:i:0
It is an odd situation in which the first read's end (
1001+57+49+13
) extends past the end of the second read (1004+65
) . So practically speaking it is not quite what a "proper" alignment would look like.But taking directly the definition of TLEN based on alignments, it will go from the leftmost coordinate
1001
to the rightmost coordinate of the other read1004 + 65
and as it happens1001 - (1004 + 65) = -68
I am very sorry that I took an wrong example. I have updated the example. I think the problem maybe have something to do with hard clipping.
please don't modify the example since the existing explanations don't make sense then. You can always add a new example to the post as an edit.
The problem with your second example is that the pairs map to different chromosomes - at this point TLEN does not have a meaning.
I am sorry I made another mistake. I have added the previous example. Would TLEN be set 0 if the pairs mapped to different chromosomes?
The SAM specification is incomplete and does not define how hard or soft clipping should interact with TLEN (or, for that matter, how indels should affect TLEN). So, if reads are clipped, or if they contain indels, I would not trust TLEN since it's not specified in those situations.
If you want to find out how long a read is, you need to walk the cigar string.
I think the issue is that clipping does not factor into alignment and TLEN is a concept deduced from the aligned regions.
The wording is perhaps misleading - instead of "observed Template LENgth" it should be something like "apparent size" or "virtual length" a measure that reflects that we deduced this from comparing to a reference and not necessarily reality.
So... BBMap tries to identify the actual insert size (which it reports as TLEN) based on alignment. If there are long deletions in reads, those are not included in the insert size. Of course if paired reads have an intron / long deletion in the unsequenced middle region there's nothing I can do about that since it's not detectable. Furthermore, it correctly compensates for paired reads that have insert size shorter than read length. So I think it does a very good job of calculating TLEN.
That said - since TLEN is ambiguous... who knows? Yesterday, I was in a code review discussing circularity detection of plasmids. TLEN was being used as a proxy for aligned read length. Specifically, it was assumed that TLEN=10000 meant that the read aligned to 10kbp of reference sequence. I objected, noting that when there is clipping, TLEN is basically undefined (or actually, it's never fully defined, but BLASR does not support long indels so the only important point was clipping). But it turns out that BLASR produces the desired results in this case - specifically, for BLASR, TLEN means the length of the aligned sequence after hard-clipping and is completely unrelated to the observed transcript length, so it contradicts the definition. Essentially, it seems to simply report (rightmost coordinate)-(leftmost coordinate)+1 as TLEN, which is obviously incorrect, but without a strict definition, it's hard to say what is correct.
I suggest using great caution when considering TLEN for any analysis since it is not fully defined. In fact, I suggest never using it, ever.
How can I calculate the insert size for each entry if I don't use TLEN? just use (rightmost coordinate)-(leftmost coordinate)+1?
No, you parse the cigar string. Matches/mismatches contribute to length, deletions don't, but insertions do. And then adjust using the leftmost and rightmost coordinates of the read pair. For example, if a read has a 10000bp deletion, that should NOT contribute to the TLEN since obviously it is not part of the "template".