Paired-End Sam Flag Mismatch During Filtering
4
1
Entering edit mode
12.0 years ago
Andreas ★ 2.5k

Hi BioStar users,

while removing human contamination from paired-end metagenomic Illumina-shotgun data I ran into a strange problem. In a first step I mapped the paired-end data with BWA (sampe) against hg19. Then I only wanted to keep read that didn't map to hg19 and whose mates also didn't map, i.e. sam flags 0x4 and 0x8 both have to be set for the two reads in a pair to be considered clean (I'm not interested in single unmapped reads)

If I filter accordingly (no matter how) I end up with inconsistent mate-pair information / odd number of reads and here is why: there is at least one hit against the very end of chrM for a pair of reads, such that one read would span the end of chrM (also, both are strangely mapping to the same position). So far, so good: I've seen this before when mapping against circular or small chromosomes. BWA somehow deals with this and sets the alignment flag to "unmapped", but keeps the alignment coordinates in the SAM output.

Now, the first read gets flag 77 (read paired, read unmapped, mate unmapped, first in pair), the other flag 133 (read paired, read unmapped, second in pair), i.e. the mate unmapped flag is not set in the second read. The trouble starts when you only keep reads if they and their corresponding mate are unmapped, because all of a sudden you miss one read / have one too many.

Here the offending mate-pair (both unmapped, but one "mate unmapped" missing):

SRR341644.4631996    77    chrM    16500    37    75M    =    16500    0    CATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGGG    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCCCCCCCCCCCCCCCDBCCC    XT:A:U    NM:i:2    SM:i:37    AM:i:0    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:73A0A0
SRR341644.4631996    133    chrM    16500    0    *    =    16500    0    TGATAGACCTGTGATCCATCGTGATGTCTTATTTAAGGGGAACGTGTGGGCTATTTAGGCTTTATGGCCCTGA    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCBCCC>CCCCCCCCCCCCCC

Why is this flag mismatch happening in the first place? To be more precise: while un-mapping the previously mapped read pair, the flags in only one of them are set properly. Both should set have 'mate unmapped' set. Why does this happen? Is this a bug?

Andreas

samtools sam filtering next-gen • 9.1k views
ADD COMMENT
1
Entering edit mode
12.0 years ago

I assume you do not want the read that hangs off the end of chrM?

Rather than using flags, why don't you use awk to filter out reads that have entries in either the mapped chromosome column, or the mate mapped chromosome column?

ADD COMMENT
0
Entering edit mode

That's what I do now. Technically this shouldn't be necessary though if the flags were correct, i.e. if they would indicate that both reads and their respective mates are unmapped.

ADD REPLY
1
Entering edit mode
12.0 years ago

Considering the circular nature of this reference DNA may be confusing the situation? It appears, to me at least, that the issue is in an incorrect flag for the first read.

I would say that the flag of the first alignment (77) is wrong and should be (73) i.e. read is mapped. This is clear from the rest of the alignment info i.e. all the read maps perfectly to the reference: cigar string of 75M and a mapping quality of 37.

For circular DNA I might expect reads that spanned across the "join" of the two ends of the linearised reference sequence to be potentially troublesome to get an alignment. Depending on the aligner and parameters used, you might get two partial alignments to opposite ends of the reference for one of the reads in a pair while its mate has a single alignment across its full length at some distance from the ends of the reference. The read that spans the join may not have both its alignment found/reported i.e. it may have a partial alignment to just one end of the reference or the aligner may not have found any alignment.

So my question would be, what fudging is BWA actually trying to do/achieve by setting that read's flag as unmapped when it clearly is mapped? This may also be a bug in BWA. Maybe this is the reason behind the existence of "samtools fixmate" and why I'd probably run it on the output of BWA.

ADD COMMENT
0
Entering edit mode

See SeqAnswers for a thread, dated early 2010, that seems related: http://seqanswers.com/forums/showthread.php?t=5347

In particular, comment #10 provides an example that seems to show similar issues to yours with with the flags. A comment (#13) by Heng states: "Sometimes bwa flags read1 as unmapped, but for read2, the mate-unmapped flag is not set. This is a bug." And then later (comment #16): "It is the mate-unmap bit that can be mislabeled. The unmap bit is always set correctly."

i.e. you can guarantee that if the 0x4 bit is set it's correct, but the 0x8 bit may not be correct!

Check if you're running the latest version of BWA and see if that fixes the issue?

ADD REPLY
0
Entering edit mode

I was already using 0.6.2-r126 which is the most recent version.

Thanks for digging ever deeper. Nice work.

ADD REPLY
0
Entering edit mode
12.0 years ago

Have you tied using samtools fixmate?

An except from the samtools website (http://samtools.sourceforge.net/samtools.shtml).

fixmate     samtools fixmate <in.nameSrt.bam> <out.bam>
            Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.
ADD COMMENT
1
Entering edit mode

That should certainly be one way to fix the information. Still, why does it happen in the first place?

ADD REPLY
0
Entering edit mode

I suspect it's a combination of the way BWA is calculating the flags and the interpretation of (partial)alignments at the ends of chrM due to its circular nature but linear representation.

ADD REPLY
0
Entering edit mode

Yes, that explains the "un-mapping" of the "mapped" read and the update of its mapping info. But the mate-pair info in the mate is not getting updated properly. It's only updated in one of the two. Almost looks like a bug to me.

ADD REPLY
0
Entering edit mode

Can you supply an example of the alignment lines, as generated by BWA, of such a pair?

ADD REPLY
0
Entering edit mode

Added (see above). These come straight from BWA and only occur once in the BAM file. Note that, both are unaligned (as per flag) but only one is marked with "mate unmapped"

ADD REPLY
0
Entering edit mode
8.7 years ago

While dealing with TCGA WGS bam files I've come across the same problem with paired-end sequencing data.

If I use "samtools view -f 13 in.bam > out.bam" and then use "bam2fastx" there is one pair where only one of the reads is caught by this flag, the other isnt. This causes an error. The flags:

(Flag 69) - Not caught - R1 - read paired, read unmapped, first in pair

(Flag 141) - Caught - R2 - read paired, read unmapped, mate unmapped, second in pair

Alignments:

HWI-ST1222:5:1103:3409:129634#0 69 MT 16522 0 * = 16522 0 AGAGATGTGTTTAAGTGCTGTGGCCAGAAGCGGGGGTAGGGGGGGGTTTGG abbeeeecgggggiifghhdhhiiihgffdgghiigX]bccc_VaWOV^a_ RG:Z:120805_SN1222_0143_BD172BACXX_5 SM:Z:TCGA-BR-4188-11A-01D-1128

HWI-ST1222:5:1103:3409:129634#0 141 MT 16522 37 51M = 16522 0 TAAAGCATAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGAT _bbeeeeegggggihihiiiiihdhiiiiiiiiiiihifiidefcghihh] XT:A:U NM:i:2 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:6C41C2 RG:Z:120805_SN1222_0143_BD172BACXX_5 SM:Z:TCGA-BR-4188-11A-01D-1128

I'm trying "samtools fixmate", but as these bam files are very large I'm afraid this will not be feasible. I'll try the awk approach next.

ADD COMMENT

Login before adding your answer.

Traffic: 1570 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6