I am using BWA "mem" (with default setting) align PE illumina reads (with read length 150bp). In some alignments, I notice the "cigar" field is reported as "149M1S" (with flag field 163). I am wondering why the last base is reported as "soft-clip"?
From my understanding of the score strategy, by default "soft-clip" will have a penalty of "5", and mismatch will have a penalty of "4", so report "mismatch" will get higher score. Then why report as "soft-clip"?
Thank you.
Thank you very much for the reply. But still confused. Say having a read of length 150bp. The alignment cigar is "44S107M". My understanding is: The "44S" happens because there are more than 8 mistaches(suppose no indel) within the 44bp. So penalty for reporting "soft-clip" is smaller than penalty for reporting "44M" (with 8 mismatch), then this is reported as "44S". If this is not the case, in which cases cause the "soft-clip"?
I don't know about the score; look at the SAM record itself for this
44S107M
i'm pretty sure, you'll find aSA
tag containing some alternate alignments for the 44S section with a 'better' location (~44M107S
)Yes, I find the
SA
tag. But why that is abetter
location? And back to my original question,1S
happens at the end of the alignment. Why not directly report "150M"?because an alignment (
150M
=>149=1X
) should not end with a mismatch.I didn't get the point, why should not end with a mismatch. In the cigar field, "M" can be both Match and Mismatch right? The
NM
tag will record how many mismatches.this is a split read. May be it's a inversion, a large deletion, a translocation, etc.... The the correct way is to say: it seems that a part of this read starts here and another part is matching elsewhere.