Missing short indels from vcf
1
0
Entering edit mode
6 days ago
kbranger • 0

Hi all,

I'm working with shallow coverage genomic sequencing data. I'm trying to pull out unique variants from one of my samples, but from combing over the alignments on IGV, I can see that there are several indels and SNPs being left out from my .vcf with other very low quality/depth SNPs being included.

I can see the deletion in my filtered .bam file (see below), but it's lost once I use bcftools mpileup -- even with very low quality thresholds. I think this deletion is real because it's also a parent line in an F2 screen and so it is present in tens of other samples.

I'm new to bioinformatics and sequencing analysis, but I can see that there are older posts dealing with this from ~5-10 years ago with bcftools version 1.12, but I'm currently using version 1.2.

Am I doing anything obviously wrong? Thanks.

Additional info:

  • BWA-MEM2 for alignment
  • samtools view -b -q 20 -F 4 -F 256 -F 512 and markdup -r to generate my .bam file
  • (stringent) bcftools mpileup -f "ref/path" -q 30 --min-BQ 20 -b "shortlist.txt" -o "testindel.bcf"
  • (loose) bcftools mpileup -f "ref/path" -q 10 --min-BQ 5 -b "shortlist.txt" -o "testindel.bcf"

Snippet of read associated with deletion in .bam file (depth is ~17-20 on either side of deletion):

  • NT_033778.4 18398702 60 18M7D132M = 18398744 MQ:i:60
bcftools shortindels mpileup • 343 views
ADD COMMENT
1
Entering edit mode
3 days ago
LauferVA 4.7k

Hey kbranger ,

IIUC, you have a ~7 bp deletion visible in IGV (with a 7D CIGAR string and ~17–20× depth) that disappears after bcftools mpileup. I can't definitively say I have the right answer for you, but it's pretty likely to be one of the following:

Upgrade Bcftools:

  • This kind of issue with indel handling was common with older versions of bcftools, in particular when dealing with low-coverage data (your data, anticipated to have ~8–10 reads supporting a heterozygous call (but could be fewer)) fitgs that description. Versions after v1.17 include improved indel models (e.g., experimental --indels-2.0 flag) that better handle low-coverage scenarios.

Pipeline Configuration: First, it isn't clear to me what kind of family data you have. If you have sequencing results for everyone, I'd strongly recommend going with a tool that can leverage info from your other samples. Whether or not the family data you are describing is usable, I am still not convinced you should be using bcftools here ...

  • But if you stick with bcftools, ensure that after running bcftools mpileup you follow with an appropriate calling step (e.g., bcftools call with optimal parameters) to avoid discarding true indels.
  • GATK HaplotypeCaller: Uses local de novo assembly for more sensitive SNP and indel detection. Running in GVCF mode with joint genotyping improves calls.
  • DeepVariant/DeepTrio: Deep learning–based approaches that convert sequence data into “pileup images,” enhancing both SNP and indel detection.
  • Manta: Bayesian variant callers tend to do well with complex indels, etc. because of the way they model data. While Manta is focused on structural variants, it is one of the more effective tools for detecting medium/small indels using paired-end and split-read signals.
ADD COMMENT

Login before adding your answer.

Traffic: 2728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6