Hi BioStar users,
while removing human contamination from paired-end metagenomic Illumina-shotgun data I ran into a strange problem. In a first step I mapped the paired-end data with BWA (sampe) against hg19. Then I only wanted to keep read that didn't map to hg19 and whose mates also didn't map, i.e. sam flags 0x4 and 0x8 both have to be set for the two reads in a pair to be considered clean (I'm not interested in single unmapped reads)
If I filter accordingly (no matter how) I end up with inconsistent mate-pair information / odd number of reads and here is why: there is at least one hit against the very end of chrM for a pair of reads, such that one read would span the end of chrM (also, both are strangely mapping to the same position). So far, so good: I've seen this before when mapping against circular or small chromosomes. BWA somehow deals with this and sets the alignment flag to "unmapped", but keeps the alignment coordinates in the SAM output.
Now, the first read gets flag 77 (read paired, read unmapped, mate unmapped, first in pair), the other flag 133 (read paired, read unmapped, second in pair), i.e. the mate unmapped flag is not set in the second read. The trouble starts when you only keep reads if they and their corresponding mate are unmapped, because all of a sudden you miss one read / have one too many.
Here the offending mate-pair (both unmapped, but one "mate unmapped" missing):
SRR341644.4631996 77 chrM 16500 37 75M = 16500 0 CATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGGG CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCCCCCCCCCCCCCCCDBCCC XT:A:U NM:i:2 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:73A0A0
SRR341644.4631996 133 chrM 16500 0 * = 16500 0 TGATAGACCTGTGATCCATCGTGATGTCTTATTTAAGGGGAACGTGTGGGCTATTTAGGCTTTATGGCCCTGA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCBCCC>CCCCCCCCCCCCCC
Why is this flag mismatch happening in the first place? To be more precise: while un-mapping the previously mapped read pair, the flags in only one of them are set properly. Both should set have 'mate unmapped' set. Why does this happen? Is this a bug?
Andreas
That's what I do now. Technically this shouldn't be necessary though if the flags were correct, i.e. if they would indicate that both reads and their respective mates are unmapped.