I apologize if this question has been posted or answered earlier. But I did not seem to find a reasonable explanation and hence positing it here. Can anybody please help me understand how bwa mem differentiates whether a read is hard clipped or soft clipped. I mean what is the criteria to differentiate between these two by bwa mem?
When I started NGS data processing, it took me quite a while to understand this. 'bwa mem' is a local aligner. Soft clipped alignments have the full sequence of the read reported in column SEQ, while hard clipped alignments only have the part which actually aligns to the ref sequence. This is kind of a performance optimization, but with some unpleasant side effects.
Usually the first alignment of a read is soft clipped (dumped into SAM file with its full sequence), while any further alignments are hard clipped. Often you will reorder the SAM file with 'samtools sort'. After this operation it is no longer warrantied that the first occurence of a read has the full sequence.
Hard-clipping does indeed cause unexpected unpleasantness. I consider it bad practice (a premature optimization, essentially).
Can you disable hard clipping by bwa altogether ?
Have you tried bwa -Y?