hi all,
I am parsing a set of paired-end reads using Pysam. Before parsing my ".bam" file with Pysam, I make sort that it is sorted (by calling samtools sort
), so that reads with smaller genomic coordinates should precede ones with larger genomic coordinates.
I found that when I parse certain read pairs, even though the "first" read end of a pair comes first in file, pysam marks it as read2
-- meaning is_read2
returns True for it -- while the "second" read end of a pair is marked as read1
. For example, in the following SAM file:
HWUSI-EASXXX_0001:6:99:772:1104#0 147 10 98472853 255 36M 98472914 0 AGACAAGATTTGGCCAAAGCTTCGAGTACTTGCAAG ggggegggggegggggdgdccggggggfggfggggf NM:i:0
HWUSI-EASXXX_0001:6:99:772:1104#0 99 10 98472914 255 10M384N26M = 98472853 0 CTGGTGAAAGGTATAATTGACAGCACAGTCTCAGAG eWdfegdgeggfagggdgg_dgdggggggfgbe_eg NM:i:0 XS:A:+ NS:i:0
The read that appears first in the file is the one with the smaller genomic coordinate (98472853), however, I find that is_read2
is true for that first read, while is_read1
is true for the second read (whose genomic coordinate is 98472914.)
Does this mean that this read pair is problematic, or is this a technical issue? Any advice on this would be greatly appreciated. thanks.