I am trying to use BWA to align my NGS data to a reference genome. My NGS data may come from very long deletions from specific regions of the genome. I wonder if BWA or its associated can detect such long deletions. Are there any parameters that I have to tune? Or is there a limit for BWA to detect long deletions?
Thank you for replying to my question. I may need to clarify this a bit. In general, when we align reads to a reference using BWA it will output a CIGAR string in alignment result (SAM file). This string will indicate the deletion of aligned reads. I am wondering if my reads are long enough, for example, 2 x 300 bp paired, will BWA be able to detect or report deletion about 100 bp or longer? I don't want to make variant calls or involve other tools, just to understand BWA's capabilities.
100bp deletion is not that long. You can pair BWA with appropriate callers to get variants of that size. So as to your question about "capability", yes, BWA is able to do so. If you are asking if BWA has bias against longer deletions, sure it does because of gap extension penalties.
A variant caller will infer by comparing alignments to a reference sequence, so you could still detect longer deletions.
However, to answer your other question directly, you probably won't see these pop-up in the cigar string since long deletions would be more equivalent to intronic sequences. To my knowledge, BWA is not optimized for this type of alignment, which is why one reason we don't usually see it used for RNA-seq alignments.
Instead, BWA is a local aligner, meaning if there is an aligned read that spans a long deletion, then it is likely that only a part of the read will align and the rest will be soft-clipped. However, you may be able to tweak the parameters to try and capture a longer deletion within the cigar string, but there's probably an upper limit to this, I think 100 bp may be doable with 300bp reads, but I've never looked into this.
At least in theory, you could consider STAR which may more accurately map these reads and report the spanning of longer junctions in the CIGAR string.
When you expect long deletions in NGS data, I recommend BBMap; it's specifically designed for large deletions in short reads. The defaults are usually fine but in this case you may want to add the flag "maxindel=800k" to allow alignment to deletion events of 800000bp, which will work fine with 300bp reads. Its accompanying variant caller (CallVariants) is also designed to call indels directly from alignments rather than from inference.
minimap2 may be unique qualified to do this, particularly aligning long reads. Quote from minimap2 paper "Now minimap2 v2.22 can more accurately map long reads to highly repetitive regions and align through insertions or deletions up to 100kb by default".
note: if you are using e.g. paired end short reads, its likely you would be looking for "longer than expected.distance between pairs" for large deletions though
Hi,
Thank you for replying to my question. I may need to clarify this a bit. In general, when we align reads to a reference using BWA it will output a CIGAR string in alignment result (SAM file). This string will indicate the deletion of aligned reads. I am wondering if my reads are long enough, for example, 2 x 300 bp paired, will BWA be able to detect or report deletion about 100 bp or longer? I don't want to make variant calls or involve other tools, just to understand BWA's capabilities.
Thank you.
-Xiaokuan
100bp deletion is not that long. You can pair BWA with appropriate callers to get variants of that size. So as to your question about "capability", yes, BWA is able to do so. If you are asking if BWA has bias against longer deletions, sure it does because of gap extension penalties.