Is there a way to tell bwa aln to do not soft-clip reads and try to align them?
We see that many soft-clipped reads result in false positive variant calls. In the attached IGV snapshot , top track, you see an example of a (PCR duplicated) read that is mapped with 4 substitutions, 1 deletion and 2 insertions. We think that the resulting variants are not true but caused because the read should not have been mapped there (it happens in all 20 samples!).
When we go into the bam-file we see that the read was soft-clipped dramatically (CIGAR 1S18M1D3M1I1M2I11M114S, meaning 115 out of 151 bases were soft-clipped). When I remove all reads from the bam-file that have a CIGAR-string including S (soft clipping) than all the wrongly mapped reads disappear (see IGV snapshot, bottom track).
We believe, at least for our specific study design/questions, we do not want bwa to soft clip reads and try to align them. Is there a way to do so? Or is there a way to remove soft-clipped reads from a bam-file (that includes updating other information like sam-flags up the paired-read to the right numbers?)
Exactly what I was looking for, thanks.
Disabling soft clipping is fine for genomic alignment when you have more than required coverage. But normally you should not disable it. One thing you can do is write a script that filters reads with a lower ratio of number of bases aligned to the total length of the read. You can set it to 0.75 ans see if it works for you. Also, I dont think the soft clipped region is used for variant calling at all. I assume whatever you are doing is for the visualization purpose.