I am developing an NGS Bioinformatics pipeline to analyze somatic amplicon data. After mapping my reads, I soft-clip both ends of my reads because these reagions are the primers and I dont want them to be used for the variant calling. I soft-clip these regions with this simple code
apptainer exec "$samtools" samtools ampliconclip \
--strand \
--soft-clip \
--fail \
-b "$primercoordinatesbed" \
"${Sample_ID}"_mapped_sorted_IRealigned.bam > "${Sample_ID}"_CLIPPED.bam
After analyzing many different samples, I have realised that occasionally, a small number of reads show a soft clip that covers the entire read
MN01972:56:000H5KYM3:1:11102:21686:4925 256 chrX 124025823 0 150S * 0 0 ATATAAATTTCTAAGGAAGGGTTTTAATTCGCGAATTTGACGATCGTTGCATTAACTCGCGAACCCAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG AFFFFFFFFFFF6FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFF66//FFF/FFAAFFF=A/FFAFFF/FF/FFF////AFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:chr2,32916355,+,104S46M,0,0; RG:Z:230926_MN01972_0056_A000H5KYM3_RACP2-6poolv6-P5-A UQ:i:0 AS:i:30
(I believe this happens because the reads came with a large soft clip in the middle and the two additional soft clips finally completely softcliped the entire reads)
This is very unlikely BUT This returned an error in PISCES (the variant calling I use) and to solve this issue I developed this code to unmapped those reads
apptainer exec "$samtools" samtools view -h "${Sample_ID}"_CLIPPED.bam \
| awk 'BEGIN {OFS="\t"; total=0; changed=0} $1 ~ /^@/ {print; next} {total++} $6 !~ /^[0-9]+S$/ {print} $6 ~ /^[0-9]+S$/ {$2 = 512; $5 = 0; changed++; print} END {print "Total reads: " total > "/dev/stderr"; print "Changed: " changed > "/dev/stderr"}' \
| apptainer exec "$samtools" samtools view -bS - > "${Sample_ID}"_FINAL.bam
This overcome the issue BUT The problem comes here.
I am improving my code and I have added ValidateSamFile before the PISCES to check my BAM file. This is more pedantic than PISCES and I get so far the following error
ERROR::INVALID_CIGAR:Record 1285596, Read name MN01972:56:000H5KYM3:1:11102:21686:4925, No real operator (M|I|D|N) in CIGAR
Is there a proper way to deal with these types of errors? I am wondering if I could use PYSAM instead of that horrible awk command to somehow null this problematic reads so that I can pass the validation checking without adding IGNORE=INVALID_CIGAR
in the ValidateSamFile
I never use PYSAM before. Any advice in any direction will be welcome.
NOTE: To delete those ^[0-9]+S reads is not a solution as my supervisor doesn't like this approach. I need something alternative to this.
Maybe you can trim read ends with fixed length in fastq file before alignment?