I am using samtools mpileup to generate a pileup for paired-end RNA sequencing data. I am curious about how samtools handles pair mates whose read mappings overlap. With regard to the simple example below, for positions 7-11, are both pair mates enumerated?
I am parsing the pileup output to quantify RNA editing at some positions and do not want to count an 'RNA molecule' twice just because the read pair mates 'overlapped.'
NOTE! This answer from Ashutosh Pandey is now misleading. At least for anyone using samtools >= 1.0 (which was released ~6 months after the answer was posted).
Since 1.0 (released Aug 15, 2014), we have the (somewhat confusingly named) --ignore-overlaps option:
With this option, samtools mpileup 1.x behaves like it did in 0.19, but the default behavior is to NOT DOUBLE COUNT overlapping paired end reads.** It has been this way for the past 3.5 years.
You can try it yourself:
$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.0 (using htslib 1.0)
$ cat test1.sam
@HD VN:1.3 SO:coordinate
@SQ SN:chr1 LN:248956422
seqid 163 chr1 39415 0 9M = 39418 12 ATGTGTTTT AAEE6EEEE
seqid 83 chr1 39418 0 9M = 39415 -12 TGTTTTGCA E<EAEEA<<
$ samtools mpileup test1.sam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1 39415 N 1 ^!A A
chr1 39416 N 1 T A
chr1 39417 N 1 G E
chr1 39418 N 1 T i
chr1 39419 N 1 G Q
chr1 39420 N 1 T i
chr1 39421 N 1 T e
chr1 39422 N 1 T i
chr1 39423 N 1 T$ i
chr1 39424 N 1 g A
chr1 39425 N 1 c <
chr1 39426 N 1 a$ <
$ samtools mpileup -x test2.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1 39415 N 1 ^!A A
chr1 39416 N 1 T A
chr1 39417 N 1 G E
chr1 39418 N 2 T^!t EE
chr1 39419 N 2 Gg 6<
chr1 39420 N 2 Tt EE
chr1 39421 N 2 Tt EA
chr1 39422 N 2 Tt EE
chr1 39423 N 2 T$t EE
chr1 39424 N 1 g A
chr1 39425 N 1 c <
chr1 39426 N 1 a$ <
Note that for this to work right, the paired read names must match (which they do for Illumina data).
If I have to guess then I think that the overlapping portion or bases will be counted twice.
You might drop your comment as an answer, as I think you are correct.
cross posted on seqanswer: http://seqanswers.com/forums/showthread.php?t=36151
Sorry, is cross-posting a no-no? I am trying to reach more people and plan to post here a link to any conclusive answer I get elsewhere.
see "Avoid cross-posting" in How to ask Good Questions on Technical and Scientific Forums