It seems like this should be obvious, but after carefully looking at some .SAM lines, I'm having a difficult time getting the numbers to crunch.
For example, here are a few lines from my SAM file:
READ_ID_A:99419 99 Chr1 45474 50 76M = 45556 244 GTCTTTGCAGCAAAAGCAGAACAGTTGGTTTACGACTCACTCTTCTCGATACCTTCTCTGACGATGATTCTGCGAC
READ_ID_A:99419 147 Chr1 45556 50 4M86N72M = 45474 -244 ATTGTGTTCCATTGAATGATAAAGCCGCATCACGTTCTTCACCGCTTGTAAAAGAAAGAAAGGCAAAGACTCTGTT
READ_ID_B:27674 99 Chr1 155388 50 76M = 155531 219 TTCAGCTTCTTTGAATCTCTTGACGTTGTGTAGAAGCCATTTGTATGATTCATCTTTTCGGTCTTGACACGGATCG
READ_ID_B:27674 147 Chr1 155531 50 76M = 155388 -219 CACACGACACCGTTTCGTCTAGCTTCGGCAAGTGAAGCAGAAACGTGAGGACGTTGGCATTTGATGCATAGAAAAT
READ_ID_C:17835 99 Chr1 180537 50 76M = 180672 211 TGCGCTTGTGGTTGATCTTTCTTCTCTCCTTCCTTCTTATCGCCACCTTCTTTCTTCTCTTCTTCCTTCTTCGGTG
READ_ID_C:17835 147 Chr1 180672 50 76M = 180537 -211 CCACCACCTTCCTTCTTCGGCTCCTCCTTCTTCTCCTTTTCCGGCTCTTTCGCAGGTCCCACTAGTACGATATCCG
Now, according to the SAM format specification, fields 4 and 8 are the leftmost starting points of the R1 and R2 reads, respectively. So, if R1 for READ_ID_A
starts at 45474
, and the read is 76bp long, the end point should be 45474 + 76 = 45550
. Then, the R2 leftmost starting point is 45556, which is only 6 bases away from the R1 ending point! It seems to me that the insert would be 6 bases (!!), but field 9 specifies this insert as 244.
I'm sure there is some fundamental error in my logic here, so I'm hoping someone can point it out for me. Thanks!
EDIT: Any ideas? All suggestions/comments appreciated!
IMHO one of the shortcomings of the SAM format is that it only reports the rightmost coordinate and one has to do all that cumbersome little parsing to figure out the rightmost end, thus rendering column oriented tools unusable.
@ashutoshmits Thanks for the great explanation, that certainly makes more sense now. So, you compared the leftmost read coordinate (start of R1) with the rightmost read coordinate (end of R2) for a total of 244 bases. Since R1 and R2 each have 76 bp, could I then be sure the insert size was 244 - 76 - 76 = 92 bases?
Yes. Tools like Tophat ask you to mention the inner distance between mate pairs. It will be ~100 bp in your case. Tools like BWA estimate the insert size based on the first few thousand alignments of the mate-pairs/paired-ends.
Thanks for all of the help, I am much more familiar with SAM now!
Thanks for the explanation. i was trying to understand answers from this post: C: Bowtie2 classification of discordantly mapped pairs. Your post was helpful in it.
Thank you so much, I've been looking for this answer for a long time.