I'm working with Illumina Miseq reads and i'm having some trouble with variant calling.
I used cutadapt for trimming adapters, bwa for alignment and GATK HaplotypeCaller (-dontUseSoftClippedBases) for variant calling.
I also used vcfx (http://www.castelli-lab.net/apps/apps_vcfx.php) to better check the calling. I looked at the positions where vcfx marked as interrogated using IGV and saw many reads with soft clipped bases.
I read about soft and hard clipped bases and i thing I understand what they are but it's not clear to me WHAT they are exactly. Part of the read matches the genome (great base and mapping quality) but the soft clipped parts don't match the genome or the adaptors (these bases also have phred score >30, so trimming for quality doesn't help).
I did find some sequences like CGTGTCGCTGGTGCGGTCT that show up in many reads. I blast it and it matched to bacteria but not phix...
If anyone can help me understand what these reads might be it would really help me decide what to do with them!
If your read has soft-clipped bases, the adapter/whatever bases that are marked as soft-clipped are still in the SEQ column.
If your read has hard-clipped bases, then all the positions in the file/etc are the same, but the adapter/whatever bases have been removed (presumably to save space, or make things simpler for downstream programs which don't understand soft-clipping). What defines whether a base is clipped or not is up to whatever program is marking them - in your case cutadapt because they are adapter sequences (which is why you cant see them in the genome alignment).
Generally, soft-clipping is superior to hard clipping. Hard clipping is deleting data, which is generally a bad idea.
Why don't you do a clean de-novo assembly of your reads? You have mapped your reads to a reference sequence. If you see a pile-up of many clipped reads at some sites, this usually means that your reference sequence differs from the true sequence of your sample at that site. This often happens at sequence repeats, where the number of repeat units may vary substantially even between closely related cellular lineages.