Hi everyone
I have a question regarding BWA mem alignment
I was interested in finding out the performance gains I would obtain from creating a artificial reference where I masked everything I was not interested in my panel.
In short I blated all my regions of interest on my panel and created a masked fasta file of the Human hg19 where I masked with N anyregion that would not have some degree of sequence similarity to my regions of interest
This basicaly masked around 90% of the human genome.
I then created a BWA index with that masked fasta file and aligned my samples (that I had already previously done with the normal human reference) to see if there was any performance gains in terms of the time it took to align them.
To my surprise the alignment against this much smaller reference (same length as the human genome but with over 90% of bases MAsked) was slower than the normal alignment against the hg19 assembly.
Can someone tell me why this is?
Many thanks
Duarte
Not an answer to your question but:
That's generally not a good idea. You would as such 'force' bwa to align sequences to the target which may not actually belong there. Note that an aligner will search for a reasonable "best possible match" - which is not necessarily the real location.
For example off target reads and pseudogenes could get mapped on your real target, obfuscating 'real' results.
I know this to not be considered a good practice. But I was interested in looking at it anyway.
As you can see in the my post I did consider the overfitting of the reads to my targets ... this is why I blated every partial match to my targets, expanded its length and kept those unmasked as well./
I do not intend to use this as a standard practice. I was just curious on the performance gains we would get from what I assumed would be a much faster alignment step.
My question comes from the fact that it isn't faster... in fact it is slower than the alignment against a full (unmasked) assembly.
I wanted to understand why this would be so as in my mind the index should be much smaller and faster on a genome that is mostly consisting of "N"s