Question

BWA mem of a heavily masked reference index takes more time than the normal reference?

1

Entering edit mode

6.4 years ago

Duarte Molha ▴ 240

Hi everyone

I have a question regarding BWA mem alignment

I was interested in finding out the performance gains I would obtain from creating a artificial reference where I masked everything I was not interested in my panel.

In short I blated all my regions of interest on my panel and created a masked fasta file of the Human hg19 where I masked with N anyregion that would not have some degree of sequence similarity to my regions of interest

This basicaly masked around 90% of the human genome.

I then created a BWA index with that masked fasta file and aligned my samples (that I had already previously done with the normal human reference) to see if there was any performance gains in terms of the time it took to align them.

To my surprise the alignment against this much smaller reference (same length as the human genome but with over 90% of bases MAsked) was slower than the normal alignment against the hg19 assembly.

Can someone tell me why this is?

Many thanks

Duarte

alignment bwa • 2.9k views

ADD COMMENT • link updated 6.4 years ago by karl.stamm 4.1k • written 6.4 years ago by Duarte Molha ▴ 240

2

Entering edit mode

Not an answer to your question but:

creating a artificial reference where I masked everything I was not interested in my panel.

That's generally not a good idea. You would as such 'force' bwa to align sequences to the target which may not actually belong there. Note that an aligner will search for a reasonable "best possible match" - which is not necessarily the real location.

For example off target reads and pseudogenes could get mapped on your real target, obfuscating 'real' results.

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

1

Entering edit mode

I know this to not be considered a good practice. But I was interested in looking at it anyway.

As you can see in the my post I did consider the overfitting of the reads to my targets ... this is why I blated every partial match to my targets, expanded its length and kept those unmasked as well./

I do not intend to use this as a standard practice. I was just curious on the performance gains we would get from what I assumed would be a much faster alignment step.

My question comes from the fact that it isn't faster... in fact it is slower than the alignment against a full (unmasked) assembly.

I wanted to understand why this would be so as in my mind the index should be much smaller and faster on a genome that is mostly consisting of "N"s

ADD REPLY • link 6.4 years ago by Duarte Molha ▴ 240

score 3 · Answer 1 · 2018-07-13

3

Entering edit mode

6.4 years ago

karl.stamm 4.1k

The masking doesnt make the reference genome smaller, it's the same basepair length.
What masking does however is let your reads possibly align in many more places. Every masked section could be a match to either end of your read. So rather than using BWA's intelligent "seed" technique to limit alignment time, you've forced it to look in many more places for candidate sites.

I have seen this problem with GATK's HaplotypeCaller when looking near the centromere's run of many AAAA. The thing gets stuck in a loop trying millions of possible alignments for every read.

ADD COMMENT • link 6.4 years ago by karl.stamm 4.1k

0

Entering edit mode

I think I understand your point... but why would BWA try to align any read against a string of N's? I can understand if it tried to align it against a homopolymer region and have some difficulty... but a homopolymer such as AAAA is a valid sequence to which you can try and find a alignment with a query sequence whereas a string of Ns isn't.

In fact HG19 has the PAR regions on CHRY masked purposefully so that the aligner aligns the reads on PAR only to chrX for this very reason... is it not?

ADD REPLY • link 6.4 years ago by Duarte Molha ▴ 240