Hi everyone,
I'm working on a clustering analysis of multiple samples. I have BAM files that went through QC. My current workflow involves:
- Variant calling with bcftools (To replicate GVCF behavior and avoid issues with 0/0 genotypes appearing as missing, I don't use the -v option).
- Merging the resulting BCF files with bcftools merge
- Filtering insertions/deletions (indels)
- Splitting multiallelic sites
- Clustering
I'm curious about a potential issue with this method. mpileup has a -b option that allows calling variants from a list of BAM files directly, bypassing the merging step (simulating a merged .bam file). However, the results from these two approaches differ:
bcftools merge
method: Finds ~100,000 more SNPs, but lacks 12,000 SNPs identified by thempileup -b
method.mpileup -b
method: Finds fewer SNPs but includes 12,000 unique SNPs
While mpileup -b
identifies fewer SNPs, its clustering aligns better with my expectations, suggesting higher accuracy.
However, this method comes at a cost: it's significantly slower and limited to single-threaded execution,making it impossible to run for 100+ samples. In contrast, calling variants individually and then merging them allows for parallelization, offering a substantial speed advantage.
Does anyone have insights into why these methods produce different results?
I'm happy to share my code and any additional information to help diagnose the issue.
Out of concern that indel calling might influence the results, I also attempted variant calling with the -I
(mpileup) option. However, the merged VCF file was identical to the one generated by the first method without -I
, after the filtering of Indels.