Filtering noise out of SAM file using CIGAR
1
0
Entering edit mode
6.0 years ago
blur ▴ 280

Hi,

I have SAM files that have noise in them - i.e. reads that have long areas of soft clipping in them (especially the edges). I was wondering if there is a smart way/tool to filter all the "bad" reads using CIGAR? I can get it with awk (for example, to filter out long areas of soft clipping at the start of the read I did: awk '{if($6~/S/) split($6,a,"S"); if(a[1]<12) {print $6}}' ) but I realize it is too naive as there might be a full array of CIGAR information that might render this line useless (If I had 1H2S40M for example, this would not be useful). Anyone know of a smart way to deal with this?

Thanks in advance,

SAM CIGAR • 2.3k views
ADD COMMENT
0
Entering edit mode

Have you ever seen soft clipping which is not on the edges of a read? Are you sure you have to remove these reads? Is this long read sequencing data?

ADD REPLY
0
Entering edit mode

see my example - I got a CIGAR where there was a hard clipping followed by a soft clipping - 1H2S40M - basically indicating, to me at least, that this is not a very reliable read location... so it is at the edge, but my naive script wouldn't deal with it well. As for aligner - BWA mem alignment (due to other restrictions) seems to allow too much noise through

ADD REPLY
1
Entering edit mode

If you have hard clipping then that tends to indicates that you have a supplementary alignment somewhere.

ADD REPLY
0
Entering edit mode

But a hard clipping of 1?

ADD REPLY
0
Entering edit mode

I don't think that actually existed in the file, it's just a made up example.

ADD REPLY
0
Entering edit mode

regretfully, it is an actual example from my SAM file. Any pointers as how to change my BWA mem options to get rid of these? Thanks,

ADD REPLY
0
Entering edit mode

Smells like a bug to me - are you using an up to date version of bwa?

ADD REPLY
0
Entering edit mode

What is the end goal of this? It's normally more efficient to tell your aligner that a certain fraction of the read needs to align for it to be valid.

ADD REPLY
1
Entering edit mode
6.0 years ago
Carambakaracho ★ 3.3k

as Wouter points out above, by specification hard clipping is the first or last operation. Soft clipping is the first or last operation, unless hard clipping occurs. And as Devon wrote, you can influence clipping behavior in most aligners.

to filter out reads you can use something like this (untested though):

 perl -nwe '@b=split; if ($b[5] =! /^(\d*H)?(\d*S)?\d*M(\d*S)?(\d*H)?$/) {print $_;}'
ADD COMMENT
0
Entering edit mode

thanks! I'll try this

ADD REPLY

Login before adding your answer.

Traffic: 1814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6