I'm aligning 75-mer paired end reads from a Illumina HiSeq RNA-seq run using BWA 0.6.0, mapped against a custom transcriptome. I have a read which spans two exons. The mystery: if I build the reference in two different ways, both of which have a perfect match for the read, I get MAPQ scores of 7 and 29. The low score comes from an alignment against exon junctions of length 150; the high score comes when I align against consensus transcripts that include all exons. In the low score read the read's pair does not align to the same junction fasta (it has a perfect match elsewhere) while in the high score the read pair matches elsewhere in the same consensus sequence.
Why would the same read have two different scores when there's a perfect match both times?
I think there's a hint in the "template-independent mapping quality" flag; the low score alignment has SM:i:0 AM:i:0 while the high-score has SM:i:29 AM:i:29. If that's the explanation, can someone explain what these flags mean?
The low-scoring and high-scoring reads follow, broken into three lines for easier reading:
HS20_6661:1:1308:2103:61534#6 163 my_junction 70 7 75M
my_pair_name 53 0 CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD?
X0:i:2 X1:i:0 XA:Z:my_junction,+70,75M,0;my_pair_name,+70,75M,0; MD:Z:75 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
HS20_6661:1:1308:2103:61534#6 163 my_consensus 1698 29 75M = 1807 195
CAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC
BCCFFFFFHGHHGIHIIJJJIJJJJIJIFHGGIIJIJJJJIJJGGBBDCEECCCDEDCBBD?AA55<8;@@>BD? XT:A:U
NM:i:0 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:75 XA:Z:my_consensus,+1698,75M,0;
The low-scoring junction sequence:
>my_junction
CATGTGTAACAGCAGCTGCGTCGGAGGAATGAACAGACGTCCAATTTTAATCATCGTTACTCTGGAAA
CCAGAGATGGGCAAGTCCTGGGCCGACGGTGCTTTGAGGCCCGGATCTGTGCTTGCCCAGGAAGAGACCGGAAGGC
AGATGA
I think maybe it has to do with how BWA calculates pair end mapping quality. What would be the insert size difference between mapping to junction vs mapping to whole transcript?
The name of the paired match is different for the poorer score - maybe it thinks it's on a different chromosome?
That's the issue to which I refer in my comment to DK's answer; my current formulation of "map reads to all exon junctions" puts each junction in a separate FASTA and then builds an index. BWA doesn't know the distance between these separate FASTAs, so pairs which map to separate junctions or where the mate is not in a junction get poor MAPQ scores.
What is the insert size distribution bwa was giving? You can find from the bwa error output (stderr)?
For the "junction probe" reference: (25, 50, 75) percentile: (712, 1293, 2422) low and high boundaries: 75 and 5842 for estimating avg and std inferred external isize from 14157 pairs: 1479.721 +/- 1111.644 skewness: 1.566; kurtosis: 2.139; ap_prior: 7.24e-04 inferred maximum insert size: 8416 (6.24 sigma)
Compared to the "consensus exons" reference where: (25, 50, 75) percentile: (187, 214, 238) That looks better; clearly the insert sizes are off for the junction reference. I don't blame BWA for this; it would need to know the absolute location of the junctions.