How can both pairs in paired-end sequencing data be aligned to the forward strand without reverse complemented.
2
0
Entering edit mode
5.0 years ago
kunh ▴ 10

I am so confused about this SAM file data.

SRR1972739.9835 131     AF086833        127     60      101M    =       285     159    ...
SRR1972739.9835 67      AF086833        285     60      85M16S  =       127     -159    ...

The one read has FLAG 131 and other read has 67. (FLAG 131 means that one pair is 'PAIRED,PROPER_PAIR,READ2' and FLAG 67 means that 'PAIRED,PROPER_PAIR,READ1' another is.)

I know that if one read is on the forward strand, the other should be on the reverse strand.

r1 ------->         forward
===========================
reverse        <-------- r2

So I think one read should have reverse-complement flag(0x10) and it is impossible for both pairs to have 0x10 (REVERSE) or 0x20 (MREVERSE) flag. Therefore, I think the alignment with FLAG

       0x10  0x20
read1   0     1
read2   1     0

and

       0x10  0x20
read1   1     0
read2   0     1

are possible and

       0x10  0x20
read1   0     0
read2   0     0

and

       0x10  0x20
read1   1     1
read2   1     1

are impossible. Am I wrong? Or is the data wrong? As I know, sometimes the mate sequence is attached to another chromosome, but I don't know how. I would appreciate it if you could give me the answer.

alignment next-gen sequencing sequence • 2.5k views
ADD COMMENT
0
Entering edit mode

There's a few cases where you'll definitely want to check the alignments by hand. That said, I tend to regard "properly" with a bit of suspicion.

67/131 only trip read_is_paired/in_proper_pair and R1 | R2 with neither being "reverse strand", which has interesting implications about orientation. You're correct in that one or the other should be on the reverse strand but this clearly isn't marked here.

Also in addition to flag 67, do you see any flag 83 (read reverse) or 99 (mate reverse) ?

ADD REPLY
0
Entering edit mode

For my own data, lots of 83/163 (read paired, mapped in proper, reverse strand, first && read paired, mapped in proper, mate_reverse_second in pair) and 99/147 (paired/proper/reverse/second && paired/proper/mate_reverse/first); assembly with megahit, mapping with BWA.

I'm just parsing out the sam flags via

samtools view Sample.sorted.bam | awk '{print $2}' | sort | uniq -c ;

which will let you look at the full distribution of flags detected in the bam file along with counts. May want to check which sets of flags are the majority in your set?

15954684 147 /  99
5945099 83/ 163
922556 141 / 77
365195 133 / 73
364682 89/  165
345488 97 / 145
339474 161 / 81
96041 137 / 69
95644 129 / 65
95480 101 / 153 #read unmapped, mate-reverse, 
95031 113 / 177 # read_reverse_strand and mate_reverse_strand & paired, but not proper
ADD REPLY
0
Entering edit mode

I have calculated the counts of each FLAG pairs (supplementary and secondary alignment are not included) using python.

@flag (83, 163) count: 2957
83: PAIRED,PROPER_PAIR,REVERSE,READ1
163: PAIRED,PROPER_PAIR,MREVERSE,READ2


@flag (99, 147) count: 2865
99: PAIRED,PROPER_PAIR,MREVERSE,READ1
147: PAIRED,PROPER_PAIR,REVERSE,READ2


@flag (77, 141) count: 2725
77: PAIRED,UNMAP,MUNMAP,READ1
141: PAIRED,UNMAP,MUNMAP,READ2


@flag (67, 131) count: 1367
67: PAIRED,PROPER_PAIR,READ1
131: PAIRED,PROPER_PAIR,READ2


@flag (115, 179) count: 51
115: PAIRED,PROPER_PAIR,REVERSE,MREVERSE,READ1
179: PAIRED,PROPER_PAIR,REVERSE,MREVERSE,READ2


@flag (81, 161) count: 12
81: PAIRED,REVERSE,READ1
161: PAIRED,MREVERSE,READ2


@flag (97, 145) count: 10
97: PAIRED,MREVERSE,READ1
145: PAIRED,REVERSE,READ2


@flag (73, 133) count: 6
73: PAIRED,MUNMAP,READ1
133: PAIRED,UNMAP,READ2


@flag (121, 181) count: 4
121: PAIRED,MUNMAP,REVERSE,MREVERSE,READ1
181: PAIRED,UNMAP,REVERSE,MREVERSE,READ2


@flag (65, 129) count: 2
65: PAIRED,READ1
129: PAIRED,READ2


@flag (69, 137) count: 1
69: PAIRED,UNMAP,READ1
137: PAIRED,MUNMAP,READ2

I can confirm that the mates which are proper' and having only one of the 'reverse' or 'mreverse' (flag 83/163, 99/147) are dominant but also see that there are still too many 'proper' but neither 'reverse' nor 'mreverse' pairs (flag 67/131).

.+ I used BWA-mem aligner with no optional parameters

bwa mem $ref $r1 $r2  res.sam
ADD REPLY
2
Entering edit mode
5.0 years ago

This could be be caused by a genomic rearrangement, or a miss-assembly of the reference sequence. For me, this would mean that the "PROPERLY_PAIRED" flag should not be set, but as @ctseto pointed out, the definition of "PROPERLY_PAIRED" is so different from one aligner to another that it is not much use unless you know the specific definition for the specific aligner.

ADD COMMENT
0
Entering edit mode

I would assume 67 should be 83 or 99 (read reverse or mate read reverse) and 131 would be 147 or 163 (read reverse or mate reverse)

ADD REPLY
0
Entering edit mode

Thank you for reply. I have to understand the concepts of miss-assembly and genomic rearrangement. I think the genomic rearrangement can be the reason because the reference genome I used is viral genome (ebolavirus).

ADD REPLY
1
Entering edit mode
5.0 years ago
ctseto ▴ 310

I suspect a lot of this boils down to how the sam flags are assigned and spec'd by the aligner. Even the whole "properly" thing involves some handwavium.

ADD COMMENT
1
Entering edit mode

I used BWA-MEM aligner with default option. I should inspect the bwa tool manual conscientiously. Thank you.

ADD REPLY
1
Entering edit mode

Same on my end, though my use case was mapping PE reads to the contigs which were generated from their de novo assembly. Definitely very strange...

ADD REPLY
1
Entering edit mode

Indeed, sometimes the mappings are a bit perplexing. One option if you have downstream plans is to pick out only the reads that have a particular combination of flags for your downstream analysis? It is tempting to just pick out the ones that are "mapped" or "unmapped" (mapped to your desired contigs, or not mapped to a contaminant when screening) and use the appropriate flags, but there's also a lot of granularity in exactly which flags you can pull out. I imagine some of it may be down to parameter choice...too many mismatches, or gaps, or whatnot may change a read from unmap but not map? Or if the insert length is exceptionally far, it may not be properly paired? Or if a sequence is present in its forward and reverse forms, you may get a mapping to the unexpected form even if it breaks expected pairing?

ADD REPLY

Login before adding your answer.

Traffic: 2544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6