Hi everyone,
I have an output bam file from GSNAP. I want to remove all the reads that have hard (H) & soft (S) clipping in the CIGAR string AND the corresponding mate for those reads. Note, I only have properly-paired reads in the starting bam file. For the first part I use the following command:
# remove reads with hard/soft clips info in the CIGAR string
samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam
However, this only removes the read with Hard/Soft Clipping & not its corresponding mate. Here is an example:
# example read HWI-ST1006:108:D245BACXX:6:1103:17271:82507
# extract this read from sample.bam
samtools view sample.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507'
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 163 chr8 19810830 20 99M = 19810929 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 CT:Z:2F99M0T1R4M698N95M
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 419 chr8 19810830 0 99M = 19811630 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:2 NM:i:0 XW:i:0 XV:i:0 SM:i:0 XQ:i:35 X2:i:35
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 83 chr8 19810929 20 4M698N95M = 19810830 -896 ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA?? RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 XS:A:+
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 339 chr8 19811630 0 3S96M = 19810830 -896 ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA?? RG:Z:C00060 MD:Z:96 NH:i:2 HI:i:2 NM:i:0 XW:i:0 XV:i:0 SM:i:0 XQ:i:35 X2:i:35
# extract the same read from sample.noclips.bam
samtools view sample.noclips.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507'
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 163 chr8 19810830 20 99M = 19810929 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 CT:Z:2F99M0T1R4M698N95M
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 419 chr8 19810830 0 99M = 19811630 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:2 NM:i:0 XW:i:0 XV:i:0 SM:i:0 XQ:i:35 X2:i:35
HWI-ST1006:108:D245BACXX:6:1103:17271:82507 83 chr8 19810929 20 4M698N95M = 19810830 -896 ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA?? RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 XS:A:+
So, how do I remove reads with hard/soft clipping along with its mate?
How important are those secondary alignments for your downstream analysis? If they aren't important you can follow what Pierre suggested. In the other case, when secondary alignments are important for the downstream analysis, it may get little trickier as you want to make sure that you are removing the correct pair and not everything. For example, 83 and 163 will have to be removed together. Similarly 339 and 419.
If you only one secondary alignment for each primary alignment, you can create two separate bam files using "samtools view -F 256" and then process both the files as Pierre suggested and then merge them later on.
Comment edited after Pierre's answer
Ashutosh Pandey I do have more than one secondary alignment for primary alignments & I want to keep them. The only reason I am removing hard/soft clipping is because the downstream analysis (with R package asSeq) does not recognize these clippings. But it also wants reads to be paired. In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be.
it would be easy with pysam. You can recreate a bam file with reads of your interest.
Hello Komal,
I am trying to use the command you have in your original post:
samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam
I have an alignment file(bam from single end). My goal is to remove the soft/hard clipped alignments from the original bam file. When I used the command you have in your original post I do not get the expected alignment in sample.noclips.bam
Meaning, when I use
samtools view sample.bam | grep read1
I get the resulting alignment. However, when I try to find the same read:
samtools view sample.noclips.bam | grep read1
I do not find it. There are very few reads in my sample.noclips.bam. Am I doing anything wrong here?