What'S Causing Ridiculous Md Tags In My Sam Files, After Aligning Paired Reads With Bwa?
2
4
Entering edit mode
13.4 years ago
Joe Fass ▴ 180

I've aligned paired-end Illumina reads to the human genome (indexed with bwa index -i bwtsw, using same version of BWA as for alignment). The library should contain lots of interchromosomal rearrangements, so I'm seeing the expected "[infer_isize] fail to infer insert size: weird pairing" message from bwa sampe. My alignments are ... odd. Some read pairs have no alignments, even though I can align them as single reads with no problem. And, the second read in each pair, in pairs that do both align, is always non-sensical:

SOLEXA1:7:100:1002:190#0        97      chr9    135276597       0       41M     chr5    5500811 0       CAGCTACTCAGGAGACTGAGGCTGGGGAATCGCTTGAACCC       BB=<@@)BABBB=B9=BB=AB=AB@'9:=94>9>==<>,==       XT:A:R  NM:i:1  SM:i:0  AM:i:0  X0:i:10 X1:i:518        XM:i:1  XO:i:0  XG:i:0  MD:Z:14G26
SOLEXA1:7:100:1002:190#0        145     chr5    5500811 0       61M     chr9    135276597       0       ATTAAAACAATTAAAAAAATAAAATTACAAATGGAAAGGACAAACCAGACCTTACAACTGT   B9:>BB>BB?>=BCBC@6@1?@?@26<BBA?BC@8<CCBBBCB;BCCB@BBA>BCCCBAB=   XT:A:R  NM:i:48 SM:i:0  AM:i:0  X0:i:10 X1:i:518        XM:i:1  XO:i:0  XG:i:0  MD:Z:0G0G0G0T0T0C1A0G0C0G0A0T0T0C0C0C0C0T0G0C0C0T0C0A0G0T1T0C0C2A0G0T2C0T0G0G0G2T1C1G0G1G0C1T0G1C0A0C0

... That's pretty obviously a wrong MD tag. All pairs that "align" (if you can call it that, with an MD string like that) to a single chromosome have non-zero mapping qualities, and are labeled XT:A:U (unique alignments). But all pairs that align to two different chromosomes have zero MQ's, and are labeled XT:A:R (repeat alignments). This is new behavior ... but I haven't yet been able to find the older version of BWA that doesn't behave this way with these reads. The reads are in Illumina's fastq format (phred+64 quality chars), but this happens even if I convert them before alignment. So I'm at a bit of a loss as to why this is happening.

Has anyone else ever seen behavior like this? I'm looking for any clues; anything I can test to figure this out. Thanks in advance ...

EDIT 1: I can BLAT on UCSC and see that they map F/R, with an isize ~150 bp (in one case) ... but the SAM records place them more than overlapping ... the first read is at the correct position (and orientation), but the second read is placed right on top of the first, so that its left edge (on the reference strand) starts before the left edge of the first read. The isize is listed as -5 (1st read) and 5 (2nd read). Could this be a result of the failure to infer insert size? If so, is there any way to set it manually (as there used to be for maq)?

EDIT 2: Nope - bad alignments and ridiculous MD tags are the same after running 'bwa sampe -a 500 ...'

bwa sam samtools • 6.1k views
ADD COMMENT
0
Entering edit mode

Thanks Pierre -- how'd you make the text box ... like this?[?]example output[?]

ADD REPLY
0
Entering edit mode

nope ... [?]try again?[?]

ADD REPLY
0
Entering edit mode

how about this: [?][?]final attempt[?][?]

ADD REPLY
0
Entering edit mode

ohhhhh ... tabs? how about this? no?

ADD REPLY
0
Entering edit mode

Must be an "answer"-only feature ... or question ...

ADD REPLY
0
Entering edit mode

I've been trying to see if there are any clues in the SAM record...what were the settings you used to run BWA? Generally, alignments with more than a few substitutions are rejected. Seeing an alignment with 48 edits is really unusual. I'm not sure I've ever seen that many before when running BWA.

ADD REPLY
1
Entering edit mode
13.4 years ago
Swbarnes2 ★ 1.6k

For what it's worth, I saw that once, when I pointed the sampe command to the wrong fasta, or the wrong fastq, something silly like that. So I'd double check that you didn't mistype the command, or mix up your references or file names.

Edit:

I think is confused; its looking at read 2 thinking its supposed to be read 1. I took the MD, and you apply it to your read two, it becomes the rev comp of your read 1.

gggttc1agcgattcccctgcctcagt1tcc22agt22ctggg22t1c1gg1gc1tg1cac  read 2 MD
attaaaacaattaaaaaaataaaattacaaatggaaaggacaaaccagaccttacaactgt  read 2
gggttcAagcgattcccctgcctcagtCtccTGagtAGctgggACtagAggTgcCtgCcac  converted read 2
GGGTTCAAGCGATTCCCCAGCCTCAGTCTCCTGAGTAGCTG                     read 1 rev comp
ADD COMMENT
0
Entering edit mode

yah ... triple- and quadruple-checked those things. Also re-indexed reference with exact same version of BWA that I'm using for the aln and sampe steps. Baffling ...

ADD REPLY
0
Entering edit mode
13.4 years ago
Joe Fass ▴ 180

Argh, argh, argh. Thank you so much swbarnes2! I was checking the wrong set of commands, and indeed, I'd aligned the forward reads for both the forward.sai and reverse.sai results, but not the sampe command. So sampe was reading the correct reverse reads from their fastq file, but was using the alignment found for read1 again (explains why my insert sizes were zero except for pairs mapped to two different chromosomes, which were all repeats, because how else would the same read get mapped to two chromosomes!). Sorry about the stupid mistake and false alarm. I could gripe about why sampe doesn't shout that the fastq id's don't match the read id's that should be in the .sai files ... but I should really just be more careful.

ADD COMMENT
0
Entering edit mode

I'm not sure sampe could differentiate reads from _1 and _2 files from an Illumina paired-end run, since my understanding is that the fastq seqs from both ends of the same fragment have the same ID in both files.

ADD REPLY

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6