I used BWA mem to align some Miseq (2X300bp) reads to a human reference expecting very low mapping since human reads would be considered contaminants.
If I map read1 only I get the following output from samtools flagstat:
238834 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
147 + 0 mapped (0.06%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
which is what I would expect.
However if I map the reads in paired end mode, I receive the following flagstat out:
477668 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
307795 + 0 mapped (64.44%:-nan%)
477668 + 0 paired in sequencing
238834 + 0 read1
238834 + 0 read2
307704 + 0 properly paired (64.42%:-nan%)
307742 + 0 with itself and mate mapped
53 + 0 singletons (0.01%:-nan%)
25 + 0 with mate mapped to a different chr
11 + 0 with mate mapped to a different chr (mapQ>=5)
This output does not make sense to me and I am not sure why the % mapped is that high when read1 % mapped is so low.
Here is the output of read2 alignment only:
238834 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
153 + 0 mapped (0.06%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Why is the paired end analysis showing such high % mapped when each read separately has very low % mapped? am I not interpreting these outputs correctly? Thanks for your help!
Are the read pairs that align highly repetitive? What sort of MAPQs are you getting?
Out of the 307795 reads that map in the paired end analysis, 189907 reads have a mapQ >=5 (~61%).
Hmm, that's pretty repetitive, but not enough to fully explain these results. My theory was that your sequences were repetitive enough that they had so many possible alignments that the aligner gave up, but that when they were aligned as pairs that the space became restricted enough that things then worked. I suspect that's not what happened. Anyway, perhaps looking at the PE alignments in IGV would be helpful.
Thanks Devon for your input.
I checked out the PE alignments in IGV, as you suggested, and noticed that there are small little clusters (~20 bp in length) of alignments with depth of 20000 reads for some of these clusters. Both read1 and read2 are overlapping in those clusters. IGV shows that the reads are clipped pretty heavily since these clusters are about 20 bp and the reads can be 300 long.
This sequencing data is from from an amplicon assay meant to amplify a viral genome. If I take a look at the small area where some of these reads align and blast those against our viral genome I get 100% alignment.
I am still confused as to why in SE mode I get virtually no alignment but in PE I get these clusters of alignments.