Our lab has been using a pipeline with bwa aln for consistency with the mappings that were created over the past 10+ years. All of the fastq files that we've processed until now had paired-end reads, and in all cases, bwa aln followed by bwa samp (with the input arguments and parameters listed below) generated mappings with gaps that could be called as indels with GATK.
bwa aln -t 8 Ref.fasta input_1_basefiltered.fastq > input_1.sai
bwa aln -t 8 Ref.fasta input_1_basefiltered.fastq > input_2.sai
bwa sampe -P Ref.fasta input_1.sai input_2.sai > output.sam
However, when running essentially the same commands (apart from the obvious single call of bwa aln rather than two) for an input fastq with single-end reads, i.e.
bwa aln -t 8 Ref.fasta single_end_input_basefiltered.fastq > single_end_input.sai
bwa samse -P Ref.fasta single_end_input_1.sai input.sai > single_end_output.sam
The resulting bam file has "complete" alignments with no gaps, so that indel calling is impossible. Surely this is an artifact of the parameterization.
My two questions are:
- Why would the same (default) parameters of bwa aln and bwa sams generate mappings with gaps for paired-end read fastq inputs but not for single-end reads?
- Because I need to identify indels when comparing genomes, what are my best options to correct this problem? I see from the documentation of bwa aln
https://bio-bwa.sourceforge.net/bwa.shtml
that the default open gap penalty is 11 while the default gap extension penalty is 4 (which are what I used without issue with my paired-end data). What would be some sensible values to try to reduce these penalties without sacrificing alignment quality too drastically?
Any suggestions would be appreciated.
So you are using same old version of
bwa
for 10+ years for all data?A side note. NGS aligners are non-deterministic so the result is not identical unless a seed used or the aligner has an option to generate deterministic results.
Yes, as I stated, we want to avoid introducing differences between recent alignments and data sets that we created years ago that are due to different alignment tools.
I'm aware of the stochastic component in alignment algorithms, but re-running bwa with a different seed doesn't change anything in the patterns that I describe. My question was why bwa aln generated gap-free mappings for single end reads whereas this was never an issue for paired-end reads with the same parameters.