Imagine we have a read with two ends - A and B. An aligner finds that A maps uniquely to some location, and B can be mapped to two locations that satisfy the expected insert size.
Conceptually, should we:
Treat the ends as basically independent, and store 3 alignment records in SAM (A>P1, B>P2, B>P3), where the PNEXT field of both B alignments point to the A location, and the PNEXT field of the A alignment arbitrarily points to one of the two B locations? OR
Treat the results as 2 consistent pairs, and write 4 alignment records (A>P1, B>P2 and A>P1, B>P3). In this scenario, the PNEXT of the first A alignment points to the first B location, and the PNEXT of the second A alignment points to the second B location.
The very existence of the PNEXT field implies to me that the authors intended to maintain pairing information, so that the (b) interpretation might be correct. However, if this is the case, it seems that some ambiguity might be unavoidable (e.g. imagine a scenario where A has two alignments, both starting at the same position [e.g. one spliced], and B also has two alignments starting at the same position; in this case, it seems the original pairings produced by an aligner cannot be represented unambiguously in SAM).
Thanks.
Close as duplicated on the samtools-mailing list? http://sourceforge.net/mailarchive/message.php?msg_id=28077133
I posted here to BioStar first, and cross-referenced this post on the mailing list. I would be tempted to leave it open here, and I'll update it with any relevant replies on the mailing list.