Hi,
I have SAM files that have noise in them - i.e. reads that have long areas of soft clipping in them (especially the edges). I was wondering if there is a smart way/tool to filter all the "bad" reads using CIGAR?
I can get it with awk (for example, to filter out long areas of soft clipping at the start of the read I did: awk '{if($6~/S/) split($6,a,"S"); if(a[1]<12) {print $6}}'
) but I realize it is too naive as there might be a full array of CIGAR information that might render this line useless (If I had 1H2S40M for example, this would not be useful).
Anyone know of a smart way to deal with this?
Thanks in advance,
Have you ever seen soft clipping which is not on the edges of a read? Are you sure you have to remove these reads? Is this long read sequencing data?
see my example - I got a CIGAR where there was a hard clipping followed by a soft clipping - 1H2S40M - basically indicating, to me at least, that this is not a very reliable read location... so it is at the edge, but my naive script wouldn't deal with it well. As for aligner - BWA mem alignment (due to other restrictions) seems to allow too much noise through
If you have hard clipping then that tends to indicates that you have a supplementary alignment somewhere.
But a hard clipping of 1?
I don't think that actually existed in the file, it's just a made up example.
regretfully, it is an actual example from my SAM file. Any pointers as how to change my BWA mem options to get rid of these? Thanks,
Smells like a bug to me - are you using an up to date version of bwa?
What is the end goal of this? It's normally more efficient to tell your aligner that a certain fraction of the read needs to align for it to be valid.