Hi, I have aligned a set of paired reads using BWA. I want to know the alignment length of each of the pairs. By length I mean the span of the reference genome where these two reads in the pair are mapped. I was thinking of using the following fields in the detailed sections of the bam file.
4. POS
6. CIGAR
9. TLEN (ideally this should give this result (positive one from the tow values for the pair))
However, I am not sure how this TLEN was calculated by BWA. Here are six examples from the bam. For brevity I am reporting only the columns 1,3,4,6,9:
Pair-1
D00545:139:C9VKJANXX:1:1101:1032:54904 chr11 33443776 2S99M 294
D00545:139:C9VKJANXX:1:1101:1032:54904 chr11 33444001 32S69M -294
Here 294 seems the correct length.
Pair-2
D00545:139:C9VKJANXX:1:1101:1032:57548 chr1 59493945 101M -124
D00545:139:C9VKJANXX:1:1101:1032:57548 chr1 59493922 69M32S 124
Pair-3
D00545:139:C9VKJANXX:1:1101:1032:62552 chr10 22566910 101M -207
D00545:139:C9VKJANXX:1:1101:1032:62552 chr10 22566804 69M32S 207
Pair-4
D00545:139:C9VKJANXX:1:1101:1032:63292 chr17 43439964 101M 185
D00545:139:C9VKJANXX:1:1101:1032:63292 chr17 43440080 32S69M -185
For pairs 2,3,4 also seem to have correct lengths 124,207,185 respectively.
However, problem comes with the following pairs.
Pair-5
D00545:139:C9VKJANXX:1:1101:1032:57141 chr19 49390966 101M -25
D00545:139:C9VKJANXX:1:1101:1032:57141 chr19 49391042 69M32S 25
Here the left one is mapped in the region (49390966,49391066) and the second one in the region (49391042,49391042+68=49391110). So the length should be = (49391110 - 49390966 + 1) = 145 which does not match with 25.
Another pair.
Pair-6
D00545:139:C9VKJANXX:1:1101:1032:73831 chr20 60365970 52M49S -128
D00545:139:C9VKJANXX:1:1101:1032:73831 chr20 60365784 37S60M4S 128
Here the left one is mapped in the region (60365784, 60365784+60-1) and the second one in the region (60365970,60365970+52=60366021). So the length should be = (60366021 - 60365784 + 1) = 238 which does not match with 128.
Could anybody help? Is there an existing tool to get this length I am interested about?
Thank you so much!
Soumitra
EDIT: All the columns
D00545:139:C9VKJANXX:1:1101:1032:54904 99 chr11 33443776 60 2S99M = 33444001 294 NTGGCACAGGATAGTTCATAAGCATCCCAAGGTGACTCCAAGGCACACATGTGATAANACAACCNNNNCNNTNTNNNTTCCTTCCCCAATACATTGTATTC #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF#<FFFFF####<##<#<###<<FFFFFFFFFFFFFFFFFFFFFF NM:i:11 MD:Z:55A6A0C0T0G1C0C1G1G0T0A24 AS:i:77 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:54904 147 chr11 33444001 60 32S69M = 33443776 -294 NNNNANNNNNNNNNNNNCCNNCTANGNNTNNNTCAGAAGCNACTNCNNCNNNGAGNCTCTCNNNNGNTGAATGGACTCCCTCCCTTCCCTCCCNCCCNACC ####/############<<##<</#<##<###BFFFFB<<#F<<#<##<###F<<#FFF<<####<#FFFFFFFFFBFFFFFFFFFFFFBF<B#B<<#BBB NM:i:16 MD:Z:8T3T1C0T1C0T0T3A5C0T0T0A1C26T3C0T2 AS:i:37 XS:i:20
D00545:139:C9VKJANXX:1:1101:1032:57548 83 chr1 59493945 60 101M = 59493922 -124 GTCTTATAGCGTGAGTAGAAATTGNNNGNGNNTNNNNAGGGTGNCAGAGTACAAAGTACAAGGTGCAATTGGAGAACAAAGGCCATGAGGGAGGAGGCAGN FFB<FFFF/B<FFFFBFFF<F<<<###<#/##<####FF<F<<#FFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB<B/B<<BB<<# NM:i:13 MD:Z:24A0C0C1G1G0G1A0C0A0A6G53G2T0 AS:i:75 XS:i:47
D00545:139:C9VKJANXX:1:1101:1032:57548 163 chr1 59493922 60 69M32S = 59493945 124 TGGNAAGNTGAGGTCTGAAGAAAGTCTTATAGCGNGNNNNGAAATNGACNNNGNNTNCAANGGGTGGCANNNTNNANAGTNNAANNNNNNNNNNNNGNNNN BBB#<<F#BBFFFFFFFFFFFFFFFFFFFFFFFF#<####<<<FF#<<F###<##<#<<F#<<FFFFFF###<##<#<<<##7<############<#### NM:i:16 MD:Z:0G2A3G26T1A0G0T0A5T3C0G0G1G0G1A3A8 AS:i:38 XS:i:28
D00545:139:C9VKJANXX:1:1101:1032:62552 83 chr10 22566910 60 101M = 22566804 -207 TAGTAAAATAACAGAGTTACTCAGNNNGNANNGNNNGGTTGTTNACGGTGGAGGGTGTCCAGGTTCTTGGCATCTTGAACAAAGAATTCGTCAAAACGCAN FFFFFFFFFFFFFFFFFFFFFF<<###<#<##<###FFFFF<<#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBB<<# NM:i:11 MD:Z:24G0T0G1G1T0A1C0A0T7A56C0 AS:i:80 XS:i:45
D00545:139:C9VKJANXX:1:1101:1032:62552 163 chr10 22566804 60 69M32S = 22566910 207 AAANCTGNCTCATATGTATACAGACTGCAGCAGGNANNNNACGTCNGGCNNNGNNGNACANTACATGTGNNNCNNTNGTANNGCNNNNNNNNNNNNANNNN BBB#<BF#B<FFFFFFFFFFFFFFFFFFFFFFFF#<####<<FFF#<<F###<##<#<<F#<<FFFFFF###<##<#<<<##<<############<#### NM:i:16 MD:Z:1C1C3G26A1G0G0C0A5T3T0A0C1A0T1C3A8 AS:i:37 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:63292 99 chr17 43439964 60 101M = 43440080 185 NCTTGCTAATGTTTGGTATTCTTCAACAGAGCACCAAGGTTTTCTCCCCACGATCAANACCAAGANNNCNNANGNNNTTACCCAACACTAAGACTTCAGAT #<<<B/FF/<FFBFFFFFFFFFFFFFF</FFFFB<<BFFBFFFFFFFFFFF//F/</#/</<FFF###/##/#/###/<<<F///F/FFB/BFFFFF/7/F NM:i:12 MD:Z:0G56C7C0T0T1A0G1T1C0G0T0G23 AS:i:75 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:63292 147 chr17 43440080 60 32S69M = 43439964 -185 NNNNTNNNNNNNNNNNNACNNGCANCNNGNNNGAGAGGGCNGATNTNNTNNNACANAGGGGNNNNGNTGACTGGGGGATCCTTATAAGTCCTTNCAGNGGG ####/############<<##/<<#<##<###FFFBF<<<#</<#<##<###<</#BF/<<####<#//BFFFFBF<</BBFFFFFFB//FBB#F<<#BBB NM:i:15 MD:Z:8A3C1T0C1C0G0C3C5A0G0T0G1T26G3A3 AS:i:39 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:57141 83 chr19 49390966 60 101M = 49391042 -25 CACCACTCCCGGCTAATTTTTGTANNNTNANNGNNNNCGGGGTNTCACCATATTGGTCAGGCTGGTCTTGAACTCCGGACCTCAGGTGATCCACCCGCTTN FFFFBFFFFFFFFFFFFFFFB<<<###7#<##<####FFFF<<#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBB<<# NM:i:12 MD:Z:24T0T0T1T1G0T1G0A0G0A6T56C0 AS:i:78 XS:i:54
D00545:139:C9VKJANXX:1:1101:1032:57141 163 chr19 49391042 42 69M32S = 49390966 25 GGANCTCNGGTGATCCACCCGCTTCAGCCTCCCANANNNNTGGGANTATNNNCNNGNGCCNCTGTGCCCNNNTNNCNGGCNNAANNNNNNNNNNNNCNNNN BBB#<FF#BBFFFFFFFFFFFFFFFFFFFF<FFF#<####<<<F<#<<<###<##<#<<<#<<FFFFFF###<##<#<<<##<<############7#### NM:i:15 MD:Z:3C3A26A1G0T0G0C5T3A0G0G1G0T1A3A8 AS:i:39 XS:i:33
D00545:139:C9VKJANXX:1:1101:1032:73831 97 chr20 60365970 51 52M49S = 60365784 -128 NCACCCATTCGTTCATCTCTCCATTCATCCATTCATCCATCCACCCATCCATTCATCNCTCTATCNNNTNNTNCNNNCATCCTTTCATCTCTCCATCCATT #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFB#<<FFFFB###<##<#<###</<FBFFFFFFBFFFFFBFFF<FF NM:i:1 MD:Z:0A51 AS:i:51 XS:i:36
D00545:139:C9VKJANXX:1:1101:1032:73831 145 chr20 60365784 34 37S60M4S = 60365970 128 NNNNCNNNNNNNNNNNNCTNNCTANCNNTNNNTCCATTCANCCTNTCNTNNNTCCNTCCATNNNNCNATCCATCTATCCATTCATCTCTCTCTNCATNACA ####<############<<##<</#<##<###BFFFFB<<#F<<#<<#<###<<<#<B<<<####<#FFFFFFFFFFFF<FFFFFF/FFBBBB#F<<#BBB NM:i:13 MD:Z:3T3T2A1C0T0C3A5T0T0A0T1T26C3 AS:i:34 XS:i:0
From an existing sam/bam file, BBMap's reformat.sh can calculate the insert size distribution using the TLEN field:
...but it assumes the aligner was correct in calculating TLEN. If you map the raw reads using BBMap, you will get what I consider to be a more trustworthy result, which compensates for things like overlapping reads, insert sizes shorter than read length, and indels:
You can also generate an alignment-free insert-size histogram with BBMerge if your reads mostly overlap:
...or via extension, if they almost overlap, and you have sufficient coverage and memory:
Note that for human-sized genomes and WGS that last approach would require a substantial amount of memory, though for exomes it wouldn't.
Thanks Brian for your suggestion. I was not familiar with bbmap. Is there any paper that I could refer to?
There is a paper describing it here:
http://bib.irb.hr/datoteka/773708.Josip_Maric_diplomski.pdf
Looks like a bwa bug, what version are you using?
Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.12-r1039 Contact: Heng Li lh3@sanger.ac.uk
Try the most recent version (0.7.15) and see if that produces more reasonable results. Also, if that doesn't solve the problem then please post the entire alignment so we can see the flags.
Thank you, Devon Ryan, for pointing me to the latest version. I shall run that version. However, edited the question with all the columns in the SAM file, hoping that might help you find the issue.
Thanks for the update, it further suggests that this is a bwa bug.
Could you, Devon Ryan, please let me know what I need to do to file this bug?
Create a new issue on GitHub: https://github.com/lh3/bwa/issues
But it is a potential bug in BWA not samtools.
Sorry! Post above changed to reflect right place. You may need to create a GitHub account (free).
Tagging: lh3
I came to know about 'samtools stats' command, which among other things report insert size distribution. Does samtools use TLEN field? Or does it compute the insert length on its own?
Sorry for not trying this earlier. Apparently, 'samtools stats' just uses the TLEN field, as seen below: