Missing short indels from vcf
1
0
Entering edit mode
3 months ago
kbranger • 0

Hi all,

I'm working with shallow coverage genomic sequencing data. I'm trying to pull out unique variants from one of my samples, but from combing over the alignments on IGV, I can see that there are several indels and SNPs being left out from my .vcf with other very low quality/depth SNPs being included.

I can see the deletion in my filtered .bam file (see below), but it's lost once I use bcftools mpileup -- even with very low quality thresholds. I think this deletion is real because it's also a parent line in an F2 screen and so it is present in tens of other samples.

I'm new to bioinformatics and sequencing analysis, but I can see that there are older posts dealing with this from ~5-10 years ago with bcftools version 1.12, but I'm currently using version 1.2.

Am I doing anything obviously wrong? Thanks.

Additional info:

  • BWA-MEM2 for alignment
  • samtools view -b -q 20 -F 4 -F 256 -F 512 and markdup -r to generate my .bam file
  • (stringent) bcftools mpileup -f "ref/path" -q 30 --min-BQ 20 -b "shortlist.txt" -o "testindel.bcf"
  • (loose) bcftools mpileup -f "ref/path" -q 10 --min-BQ 5 -b "shortlist.txt" -o "testindel.bcf"

Snippet of read associated with deletion in .bam file (depth is ~17-20 on either side of deletion):

  • NT_033778.4 18398702 60 18M7D132M = 18398744 MQ:i:60
bcftools shortindels mpileup • 716 views
ADD COMMENT
2
Entering edit mode
3 months ago
LauferVA 4.8k

Hey kbranger ,

IIUC, you have a ~7 bp deletion visible in IGV (with a 7D CIGAR string and ~17–20× depth) that disappears after bcftools mpileup. I can't definitively say I have the right answer for you, but it's pretty likely to be one of the following:

Upgrade Bcftools:

  • This kind of issue with indel handling was common with older versions of bcftools, in particular when dealing with low-coverage data (your data, anticipated to have ~8–10 reads supporting a heterozygous call (but could be fewer)) fitgs that description. Versions after v1.17 include improved indel models (e.g., experimental --indels-2.0 flag) that better handle low-coverage scenarios.

Pipeline Configuration: First, it isn't clear to me what kind of family data you have. If you have sequencing results for everyone, I'd strongly recommend going with a tool that can leverage info from your other samples. Whether or not the family data you are describing is usable, I am still not convinced you should be using bcftools here ...

  • But if you stick with bcftools, ensure that after running bcftools mpileup you follow with an appropriate calling step (e.g., bcftools call with optimal parameters) to avoid discarding true indels.
  • GATK HaplotypeCaller: Uses local de novo assembly for more sensitive SNP and indel detection. Running in GVCF mode with joint genotyping improves calls.
  • DeepVariant/DeepTrio: Deep learning–based approaches that convert sequence data into “pileup images,” enhancing both SNP and indel detection.
  • Manta: Bayesian variant callers tend to do well with complex indels, etc. because of the way they model data. While Manta is focused on structural variants, it is one of the more effective tools for detecting medium/small indels using paired-end and split-read signals.
ADD COMMENT
0
Entering edit mode

Hi,

You were right! While bcftools was sufficient for QTL, I was able to take the parent lines and original lines they were derived from and use GATK4-- which was able to detect the indel specified and others I suspected existed based on the CIGAR string.

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6