Question

What'S Causing Ridiculous Md Tags In My Sam Files, After Aligning Paired Reads With Bwa?

4

Entering edit mode

13.4 years ago

Joe Fass ▴ 180

I've aligned paired-end Illumina reads to the human genome (indexed with bwa index -i bwtsw, using same version of BWA as for alignment). The library should contain lots of interchromosomal rearrangements, so I'm seeing the expected "[infer_isize] fail to infer insert size: weird pairing" message from bwa sampe. My alignments are ... odd. Some read pairs have no alignments, even though I can align them as single reads with no problem. And, the second read in each pair, in pairs that do both align, is always non-sensical:

SOLEXA1:7:100:1002:190#0        97      chr9    135276597       0       41M     chr5    5500811 0       CAGCTACTCAGGAGACTGAGGCTGGGGAATCGCTTGAACCC       BB=<@@)BABBB=B9=BB=AB=AB@'9:=94>9>==<>,==       XT:A:R  NM:i:1  SM:i:0  AM:i:0  X0:i:10 X1:i:518        XM:i:1  XO:i:0  XG:i:0  MD:Z:14G26
SOLEXA1:7:100:1002:190#0        145     chr5    5500811 0       61M     chr9    135276597       0       ATTAAAACAATTAAAAAAATAAAATTACAAATGGAAAGGACAAACCAGACCTTACAACTGT   B9:>BB>BB?>=BCBC@6@1?@?@26<BBA?BC@8<CCBBBCB;BCCB@BBA>BCCCBAB=   XT:A:R  NM:i:48 SM:i:0  AM:i:0  X0:i:10 X1:i:518        XM:i:1  XO:i:0  XG:i:0  MD:Z:0G0G0G0T0T0C1A0G0C0G0A0T0T0C0C0C0C0T0G0C0C0T0C0A0G0T1T0C0C2A0G0T2C0T0G0G0G2T1C1G0G1G0C1T0G1C0A0C0

... That's pretty obviously a wrong MD tag. All pairs that "align" (if you can call it that, with an MD string like that) to a single chromosome have non-zero mapping qualities, and are labeled XT:A:U (unique alignments). But all pairs that align to two different chromosomes have zero MQ's, and are labeled XT:A:R (repeat alignments). This is new behavior ... but I haven't yet been able to find the older version of BWA that doesn't behave this way with these reads. The reads are in Illumina's fastq format (phred+64 quality chars), but this happens even if I convert them before alignment. So I'm at a bit of a loss as to why this is happening.

Has anyone else ever seen behavior like this? I'm looking for any clues; anything I can test to figure this out. Thanks in advance ...

EDIT 1: I can BLAT on UCSC and see that they map F/R, with an isize ~150 bp (in one case) ... but the SAM records place them more than overlapping ... the first read is at the correct position (and orientation), but the second read is placed right on top of the first, so that its left edge (on the reference strand) starts before the left edge of the first read. The isize is listed as -5 (1st read) and 5 (2nd read). Could this be a result of the failure to infer insert size? If so, is there any way to set it manually (as there used to be for maq)?

EDIT 2: Nope - bad alignments and ridiculous MD tags are the same after running 'bwa sampe -a 500 ...'

bwa sam samtools • 6.1k views

ADD COMMENT • link 13.3 years ago by Joe Fass ▴ 180

0

Entering edit mode

Thanks Pierre -- how'd you make the text box ... like this?[?]example output[?]