Before the problem, I give some context.
I am developing a amplicon NGS bioinformatics pipeline. I keep the primers to do the alignment, then I use samtools-ampliconclip to mask the primers and finally I use Pisces to call the variants.
The problem is that occasionally, reads come with a large soft-clip in the middle of the read and then when both extreme of the reads are soft-cliped, the whole read is a soft-clipped. The CIGAR string for this reads is 150S and the length of the read is 150. Pisces is not happy with this and return an error
System.Exception: RACP2-6poolv4_P5-A_FINAL_SORTED.bam: Error processing chr 'chr7': Failed to process variants for MN01972:49:000H5KYMY:1:11102:26356:2968 ... 150S ---> System.Exception: Failed to process variants for MN01972:49:000H5KYMY:1:11102:26356:2968 ... 150S
at Pisces.Domain.Logic.CandidateVariantFinder.ProcessCigarOps(Read alignment, String refChromosome, Int32 readStartPosition, String chromosomeName)
at Pisces.Logic.SomaticVariantCaller.Execute()
at Pisces.Processing.Logic.BaseGenomeProcessor.ProcessByBam(BamWorkRequest workRequest, String chrName)
--- End of inner exception stack trace ---.
8/19/23 7:21 PM 1 Exception reported:
System.Exception: RACP2-6poolv4_P5-A_FINAL_SORTED.bam: Error processing chr 'chrX': Failed to process variants for MN01972:49:000H5KYMY:1:22101:7666:3361 ... 151S ---> System.Exception: Failed to process variants for MN01972:49:000H5KYMY:1:22101:7666:3361 ... 151S
at Pisces.Domain.Logic.CandidateVariantFinder.ProcessCigarOps(Read alignment, String refChromosome, Int32 readStartPosition, String chromosomeName)
at Pisces.Logic.SomaticVariantCaller.Execute()
at Pisces.Processing.Logic.BaseGenomeProcessor.ProcessByBam(BamWorkRequest workRequest, String chrName)
--- End of inner exception stack trace ---.
8/19/23 7:21 PM 1 Exception reported:
System.Exception: Failed to process variants for MN01972:49:000H5KYMY:1:11102:26356:2968 ... 150S
at Pisces.Processing.Logic.BaseProcessor.Execute(Int32 maxThreads)
at Pisces.Program.ProgramExecution()
at CommandLine.Application.BaseApplication`1.Execute().
Investigating this error, one of the two the problematic reads is this
MN01972:49:000H5KYMY:1:22104:2296:19556 16 chr7 148817421 60 151S * 00 TTCAAATTTACTGCCAATGAAGAACCAAAAACATGTCGGCTCATCACCACTTCACAACACAATCCCAAGCCGTTGAAAAATTGGCCTCAACATTTATTGTGTTTGACAACGTTTAATTGGATCTACAAAACCAAATGTAAGCACTGGTCAA A//6///A/F///=FF/////F//////FF///F///F//F////F/F=////////FF///A///F6F//F/F/FF///FF/F//F/////FA//F/F///FF///A/F////F/F////F///FFFFFFFFFFF6FFF/FFFF/AAFFF RG:Z:230817_MN01972_0049_A000H5KYMY_RACP2-6poolv4_P5-BAS:i:30 XS:i:19
So I am wondering how to manipulate the bam file correctly to mark as QC fail reads with a soft-clip longer than a number for example. Approaches I have though are to descard with Pisces low-QM-reads but reads like this has a high QM. Then, I have look at in samtools-ampliconclip documentation which has some parameters to unmapped reads or filter out reads with no matching bases but this doesn't help.
I could use something similar to this
# remove reads with hard/soft clips info in the CIGAR string
samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam
# From https://www.biostars.org/p/137461/
I was wondering if there is a tools that does what I need.
ChapGPT says that there is not a tool that is able to do this and recommened me to do this
samtools view -h sample.bam | awk '($1 ~ /^@/) || ($6 !~ /^[0-9]{3,}S/)' | samtools view -bS - > sample.filtered.bam
I would write a tool using pysam in Python and have it parse out the length of the clip. Grepping may work, but then you would lose short clips.