Question

Inconsistency in SNP detection pipelines for multi-sample analysis

1

Entering edit mode

12 months ago

George ▴ 10

Hi everyone,

I'm working on a clustering analysis of multiple samples. I have BAM files that went through QC. My current workflow involves:

Variant calling with bcftools (To replicate GVCF behavior and avoid issues with 0/0 genotypes appearing as missing, I don't use the -v option).
Merging the resulting BCF files with bcftools merge
Filtering insertions/deletions (indels)
Splitting multiallelic sites
Clustering

I'm curious about a potential issue with this method. mpileup has a -b option that allows calling variants from a list of BAM files directly, bypassing the merging step (simulating a merged .bam file). However, the results from these two approaches differ:

bcftools merge method: Finds ~100,000 more SNPs, but lacks 12,000 SNPs identified by the mpileup -b method.
mpileup -b method: Finds fewer SNPs but includes 12,000 unique SNPs

While mpileup -b identifies fewer SNPs, its clustering aligns better with my expectations, suggesting higher accuracy.

However, this method comes at a cost: it's significantly slower and limited to single-threaded execution,making it impossible to run for 100+ samples. In contrast, calling variants individually and then merging them allows for parallelization, offering a substantial speed advantage.

Does anyone have insights into why these methods produce different results?

I'm happy to share my code and any additional information to help diagnose the issue.

Out of concern that indel calling might influence the results, I also attempted variant calling with the -I (mpileup) option. However, the merged VCF file was identical to the one generated by the first method without -I, after the filtering of Indels.

vcf bam bcf snps • 671 views

ADD COMMENT • link updated 12 months ago by LChart 4.9k • written 12 months ago by George ▴ 10

score 0 · Answer 1 · 2024-04-17

To replicate GVCF behavior and avoid issues with 0/0 genotypes appearing as missing, I don't use the -v option

I'm not sure what you mean by "replicate GVCF behavior". Why not write out a gvcf using bcftools call --gvcf? The call quality from merging from GVCFs should be roughly the same as from using mpileup.

However, this method comes at a cost: it's significantly slower and limited to single-threaded execution,making it impossible to run for 100+ samples. In contrast, calling variants individually and then merging them allows for parallelization, offering a substantial speed advantage.

Just a quibble: you can always parallelize over regions by specifying -r.