Hello all, I am making alignment of pair-end reads after Illumina sequencing 2x150 but my DNA fragments are mostly shorter. I assume because of the flags that they are correctly mapped. After trimming and using bowtie-2 I have noticed that the majority of my pairs have the same TLEN negative value. And also, in the sequence column (SEQ, column 10) the sequence is exactly the same. As I understood from TLEN, the leftmost segment receives the + and the rightmost segment receives the - but according to SAM manual "If segments cover the same coordinates then the choice of which is leftmost and rightmost is arbitrary, but the two ends must still have differing signs". Assuming my fragments are in this scenario, they have the same sequence but they always receive the negative sign. Is this normal? And also, regarding the sequence, why the sequence is the same in those cases? I need to retrieve from SAM the exact sequence that was aligned from each read (pair1 and pair2) and because of this problem, I am losing information from one side. Does anyone have a suggestion of what could I do?
Here is a proper pair with different TLEN sing:
MN00409:35:000H2KJ2J:1:11102:12030:17923 99 CP047231 3573289 255 151M = 3573330 **192** **AACTTTTCCGGCTTCCCGTTCGTCAGTACCTCGGGAAGCCGCCAACCAGGATAAAATGTCAGCCCTAATCAGCGTTGCAGGATAAAGCACCGCTCACTCTTCAACAGACCGATTTGCACCCCAGCAAATGTAGCGTTATTGTTACCTTCCT** FFFFFFFFFFFFFF/F/FFFFFFFFFF/FFAFFFFFFFFFFF/AFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFAFFFF6F6/6F=FAF/FFFFFFFFFFF=F=FF=FFFFFFAFFFFFFFFFFF=/FFFFAFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YS:i:0 YT:Z:CP
MN00409:35:000H2KJ2J:1:11102:12030:17923 147 CP047231 3573330 255 151M = 3573289 **-192** **CCAACCAGGATAAAATGTCAGCCCTAATCAGCGTTGCAGGATAAAGCACCGCTCACTCTTCAACAGACCGATTTGCACCCCAGCAAATGTAGCGTTATTGTTACCTTCCTTGCTACAGAGTTCGACAGATATCCCGCTATGACATTCTCCC** AA=F/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/AFFFFFAAFAFFF=FFFFFFFFFFFFFFFF=FF/6/FFFFFF/FF=FFAFFFFFF=FFAFFFFF6F/AFFFFFF6FFFFFFFF6FFF/F/6A=F6 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YS:i:0 YT:Z:CP
Here is a "problematic" pair with the same sing TLEN:
MN00409:35:000H2KJ2J:1:11102:12474:20162 99 CP047231 322941 255 112M = 322941 **-112** **GGTGATTAAACGTGTGGCGAAGCAGCTCTCGCAGGAAGGCGGCTCGCTGAAGATGTACAACATCGCCGATCGCCTGGAAACGGTGATGTGGGAGAGCAAAAAGATGTTCCCC** AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFF/AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:6C5C99 YS:i:-10 YT:Z:CP
MN00409:35:000H2KJ2J:1:11102:12474:20162 147 CP047231 322941 255 112M = 322941 **-112** **GGTGATTAAACGTGTGGCGAAGCAGCTCTCGCAGGAAGGCGGCTCGCTGAAGATGTACAACATCGCCGATCGCCTGGAAACGGTGATGTGGGAGAGCAAAAAGATGTTCCCC** =FFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFAFF AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:6C5C99 YS:i:-10 YT:Z:CP
Thank you so much :)
both reads are mapped on the very same position :
CP047231:322941
so the distance between end and start is always -112;Still, isn't the OP correct that the signs should not be the same, the SAM spec seems to state that when we cannot decide which one of the pairs is leftmost of the two, one should be declared leftmost and for that the TLEN should be 112 and for the other should be -112
The sequence is the same in both cases because the SAM file will report the aligned sequences on the forward strand (even if the alignment is on the reverse) so it will reverse complement the corresponding sequences. In the case you show the fragment is exactly of the same length as the read and each read fully contains the fragment.
But I would agree that the SAM field is incorrect, what aligner are you using? Perhaps use a different one if possible.
Thanks so much or your answers, Now I understand why the sequence is the same! I am using Bowtie2