Hi
I have some reads aligned with bwa 0.7.15, and I've found some regions with reads ending with soft clipped bases in both ends (example CIGAR string 14S58M28S). I know that soft clipped reads could have a biological meaning, but in this case it seems more like a mapping error to me (only a part of the read has been mapped, while both ends have not). However, it is a primary alignment and the mapping quality seems decent (MAPQ=32), even if it's not exceptionally high.
I have two questions here:
-Am I right with my assumption of these being mapping errors? Is there any other explanation (biological or computational) for this events to happen?
-Is there any tool available for getting rid of double-soft-clipped reads? I could probably grep those lines out from the SAM file, but it will just get too long for applying it to many files.
Thank you all
I don't think this is a good enough reason to second guess the aligner and it sounds quite arbitrary to me. I would leave those reads in.
You also didn't tell us about your application, so that makes it harder to judge.
Well I am looking for structural variations in several genomes (Paired End Illumina) and I am trying to validate my pipeline against parent-child pairs and sequencing replicates (which should in principle harbor little to none SVs, specially the sequencing replicates). It seems that there is some noise (considering as noise all SVs present in only one of the sequencing replicates) associated to this reads. In those cases, specific regions have a bunch of reads with their pairs mapped in different chromosomes (many of them for each region) and with this double-soft-clipping. That was what made me think about this reads, and I could not get any biological explanation for the double soft-clipping by myself. Simple soft-clipping are possible breakpoints for the SVs, but two SVs that close in the genome, pointing to several different chromosomes, didn't seem right for me.