I am using bwasw to align reads from a MiSeq run to a very short reference sequence. For some reason, bwa is soft-clipping (assigning S in the cigar string) the leading and trailing bases of many or most of my reads. (All of my reads seem to get the first base soft-clipped, after that there is a distribution of soft clipping that drops off toward the center of the read).
This is occurring despite the fact that the bases which are getting soft clipped (at least the ones I checked by scanning the sam) are EXACT MATCHES to the reference. They are not consistently surrounded by mismatches (they occur even in reads with an edit distance of 0), nor are they low-quality (prior to assembly, I ran a very stringent quality filter, so every base in every one of my reads is phred 36 or higher).
Possibly pertinent: all of my reads are the same length as my reference +/- 2 nts. (This is a ribozyme population deep sequencing project).
Does anyone know why bwa might be acting this way or how to get around it? Here's my bwa command:
bwa bwasw -M -NM myreference.fasta myfiltered.fastq > myoutput.sam
Alternatively, is there a way to tell samtools to ignore soft-clipping? Or a tool to convert the cigar strings to M/I/D only? The soft-clipped bases are artificially reducing the coverage in my mpileup, and I am worried how they will affect other downstream modules I want to use.
Thanks for any insight!
cross posted at SeqAnswers
It is an interesting problem, perhaps has to do with the unusual circumstance of reads being as long as the reference. Obviously If the read is longer than the reference then soft clipping is ok.
Can you edit your post and add your reference and a few SAM alignments that are soft clipped?
I wonder what the
-M -NM
options are for? I couldn't find them in the bwasw manual segment for my (possibly outdated) BWA 0.7.5... Are they a recent addition, or something undocumented?Might not be your case, but at least novoalign has an upper limit to the aligned read length. Last time I checked it was ~300 bp, and I think beyond that length it was soft-clipping the reads. I'm not entirely sure, though.