Question

Add soft clipping to a bam file

0

Entering edit mode

10.4 years ago

Lee Katz ★ 3.2k

Hi, does anyone have a way of adding soft clipping to a bam file? I want to see what happens if I add 5bp of soft clipping to each read, to see if it removes false-positive SNPs. If it works, I want to automate it.

bam sam picard soft-clipping samtools • 4.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.4 years ago by Lee Katz ★ 3.2k

Ram · Answer 1 · 2015-04-15

GATK ClipReads: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_readutils_ClipReads.php

This tool provides simple, powerful read clipping capabilities that allow you to remove low quality strings of bases, sections of reads, and reads containing user-provided sequences.

There are three options for clipping (quality, position and sequence), which can be used alone or in combination. In addition, you can also specify a clipping representation, which determines exactly how ClipReads applies clips to the reads (soft clips, writing Q0 base quality scores, etc.). Please note that you MUST specify at least one of the three clipping options, and specifying a clipping representation is not sufficient. If you do not specify a clipping option, the program will run but it will not do anything to your reads.

Ram · Answer 2 · 2015-04-15

0

Entering edit mode

10.4 years ago

Lee Katz ★ 3.2k

Would this work? I just made this ad-hoc one-liner to break down the cigar string and then replace the first and last 5bp with "S" for soft padding, and then built up the string again. I am just not sure if it is the correct way to do it and if it will work in all cases.

samtools view sample1.fastq.gz-reference.sorted.bam | \
  perl -lane 'BEGIN{$soft="S"x5;}
              $str="";
              while($F[5]=~/(\d+)(\D+)/g)
              {
                $str.=$2 x $1;
              }
              $str=$soft . substr($str,5,-5) . $soft;
              $prevNt=substr($str,0,1);
              $cigar="";
              $count=1;
              for($i=1;$i<length($str);$i++)
              {
                $thisNt=substr($str,$i,1);
                if($thisNt eq $prevNt) {$count++}
                else {$cigar.="$count$prevNt"; $count=1; $prevNt=$thisNt;}
              }
              $cigar.="$count$prevNt";
              $F[5]=$cigar;
              print join("\t",@F);'

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.4 years ago by Lee Katz ★ 3.2k

1

Entering edit mode

Simply modifying the cigar string and clipping the read sequence won't be enough. You will have to also shift the position/positions of the reads and its mate. You may have to also edit other tags such as "NM:" tag because the clipped alignment may be harboring a mismatch or an indel. The more the number of alignment related tags the more complicated it would be. Changing cigar string can be easy for simple alignments for example where you don't see insertion or deletion or mismatches in the beginning or end of reads but there may be several alignments for which your code may not work correctly. Downstream variant callers will throw an error if they find an alignment (CIGAR) which is not compatible with values in different alignment related tags in the BAM file. I think you should follow Pierre's advice on using GATK to simple clean such reads. Another approach would be to perform the cleaning at the mpileup level. It would be much easier to clean nucleotides that are present towards the end or beginning of read as they are marked using ^ and $ character in the pileup format. But pileup only marks the first and the last nucleotides for a read. Hope it helps.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.4 years ago by Ashutosh Pandey 12k