I'm working with shallow coverage genomic sequencing data. I'm trying to pull out unique variants from one of my samples, but from combing over the alignments on IGV, I can see that there are several indels and SNPs being left out from my .vcf with other very low quality/depth SNPs being included.
I can see the deletion in my filtered .bam file (see below), but it's lost once I use bcftools mpileup -- even with very low quality thresholds. I think this deletion is real because it's also a parent line in an F2 screen and so it is present in tens of other samples.
I'm new to bioinformatics and sequencing analysis, but I can see that there are older posts dealing with this from ~5-10 years ago with bcftools version 1.12, but I'm currently using version 1.2.
Am I doing anything obviously wrong? Thanks.
Additional info:
BWA-MEM2 for alignment
samtools view -b -q 20 -F 4 -F 256 -F 512 and markdup -r to generate my .bam file
IIUC, you have a ~7 bp deletion visible in IGV (with a 7D CIGAR string and ~17–20× depth) that disappears after bcftools mpileup. I can't definitively say I have the right answer for you, but it's pretty likely to be one of the following:
Upgrade Bcftools:
This kind of issue with indel handling was common with older versions of bcftools, in particular when dealing with low-coverage data (your data, anticipated to have ~8–10 reads supporting a heterozygous call (but could be fewer)) fitgs that description.
Versions after v1.17 include improved indel models (e.g., experimental --indels-2.0 flag) that better handle low-coverage scenarios.
Pipeline Configuration:
First, it isn't clear to me what kind of family data you have. If you have sequencing results for everyone, I'd strongly recommend going with a tool that can leverage info from your other samples. Whether or not the family data you are describing is usable, I am still not convinced you should be using bcftools here ...
But if you stick with bcftools, ensure that after running bcftools mpileup you follow with an appropriate calling step (e.g., bcftools call with optimal parameters) to avoid discarding true indels.
GATK HaplotypeCaller: Uses local de novo assembly for more sensitive SNP and indel detection. Running in GVCF mode with joint genotyping improves calls.
DeepVariant/DeepTrio: Deep learning–based approaches that convert sequence data into “pileup images,” enhancing both SNP and indel detection.
Manta: Bayesian variant callers tend to do well with complex indels, etc. because of the way they model data. While Manta is focused on structural variants, it is one of the more effective tools for detecting medium/small indels using paired-end and split-read signals.