I have a sam file that I would like to parse information out of ( who wouldn't?) I am interested in finding mate pairs that map over a deleted sequence with respect to a reference genome. Something that looks like this
I'm interested in pulling out mate pairs with mapping offsets greater than 3SD from the offset mean. I'm confused on how to calculate the offsets of the mate pair positions in a sam file. Do I take the mapped position of the paired read and subtract it from the read?
Thanks for your help, Sam format really confuses me.
The ninth column in the SAM output represents TLEN or the template length. This length or number includes the lengths of both the reads. So in order to get the offset distance as shown in the figure above, you will have to subtract lengths of both the reads from the TLEN. I will call this number as insert-size. Here is what you can do:
In case you don't know the mean offset/insert-size, you may calculate it by taking the average of the first 50,000 offsets/insert-sizes of properly paired mate-pair reads. Take median in case you don't want to filter for properly paired mate-pair reads. This will give you an average offset or insert size.
Then you can calculate offsets for other mate-pairs and exclude the ones that have their offsets greater than 3 SD of the mean offset.
Thanks for the reply. I think I got it but have one more question. How do I find the lengths of the mate pairs? Can I extract this information from a line like this?
It can be calculated as you know the starting points of both the alignments (column 4 and 8). You also know the difference (TLEN) between the starting point of the forward alignment and the end point of the reverse alignment. If you subtract the first difference from the second difference you will get the read length of the second alignment. The length of the first alignments can be deduced from column 6. 36 M denotes that the read has 36 nt. Be careful with the strand specificity. I mean polarity of the TLEN.
Why don't you use the value of TLEN instead of calculating insert size to detect anomalous mate-pairs. I mean for your problem it doesn't matter if you use insert size or TLEN. Both should work fine I guess.
Why not just take the TLEN column?