Calculating Mate Pair Offset from SAM file
1
0
Entering edit mode
10.3 years ago

I have a sam file that I would like to parse information out of ( who wouldn't?) I am interested in finding mate pairs that map over a deleted sequence with respect to a reference genome. Something that looks like this Reference on top, Sample below

I'm interested in pulling out mate pairs with mapping offsets greater than 3SD from the offset mean. I'm confused on how to calculate the offsets of the mate pair positions in a sam file. Do I take the mapped position of the paired read and subtract it from the read?

Thanks for your help, Sam format really confuses me.

samtools offset perl mate-pair sam • 3.9k views
ADD COMMENT
1
Entering edit mode

Why not just take the TLEN column?

ADD REPLY
5
Entering edit mode
10.3 years ago

The ninth column in the SAM output represents TLEN or the template length. This length or number includes the lengths of both the reads. So in order to get the offset distance as shown in the figure above, you will have to subtract lengths of both the reads from the TLEN. I will call this number as insert-size. Here is what you can do:

  1. In case you don't know the mean offset/insert-size, you may calculate it by taking the average of the first 50,000 offsets/insert-sizes of properly paired mate-pair reads. Take median in case you don't want to filter for properly paired mate-pair reads. This will give you an average offset or insert size.
  2. Then you can calculate offsets for other mate-pairs and exclude the ones that have their offsets greater than 3 SD of the mean offset.
ADD COMMENT
0
Entering edit mode

Devon always beats me up :-)

ADD REPLY
2
Entering edit mode

"Beats you to it", not "beats you up" (there's a HUGE difference and I hope I'm not that mean!) :)

ADD REPLY
0
Entering edit mode

Thanks for the reply. I think I got it but have one more question. How do I find the lengths of the mate pairs? Can I extract this information from a line like this?

ERR001768.2344562    83    5    26793561    60    36M    =    26793415    -182    CAG..C    .A7?@/@<9=+<4?90@>3;..>?>??;=>@:3>??    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:36    RG:Z:ERR001768    OQ:Z:4<:=<4><9<)<4>70>>0<+1>>>>>7=>><3>>>

Or do I have to read the file?

ADD REPLY
0
Entering edit mode

It can be calculated as you know the starting points of both the alignments (column 4 and 8). You also know the difference (TLEN) between the starting point of the forward alignment and the end point of the reverse alignment. If you subtract the first difference from the second difference you will get the read length of the second alignment. The length of the first alignments can be deduced from column 6. 36 M denotes that the read has 36 nt. Be careful with the strand specificity. I mean polarity of the TLEN.

Why don't you use the value of TLEN instead of calculating insert size to detect anomalous mate-pairs. I mean for your problem it doesn't matter if you use insert size or TLEN. Both should work fine I guess.

ADD REPLY

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6