Hi,
I am using bcftools to perform variant calling analysis on my .bam files. I wanted to understand what exact statistical model dose bcftools use for variant calling. I went through bcftools documentation http://samtools.github.io/bcftools/bcftools.html but other than this, are there any other resource which can help me understand?
Thank you in advance.
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data
Abstract
Motivation: Most existing methods for DNA sequence analysis
rely on accurate sequences or genotypes. However, in applications of
the next-generation sequencing (NGS), accurate genotypes may not be
easily obtained (e.g. multi-sample low-coverage sequencing or somatic
mutation discovery). These applications press for the development of
new methods for analyzing sequence data with uncertainty.
Results: We present a statistical framework for calling SNPs,
discovering somatic mutations, inferring population genetical
parameters and performing association tests directly based on
sequencing data without explicit genotyping or linkage-based
imputation. On real data, we demonstrate that our method achieves
comparable accuracy to alternative methods for estimating site allele
count, for inferring allele frequency spectrum and for association
mapping. We also highlight the necessity of using symmetric datasets
for finding somatic mutations and confirm that for discovering rare
events, mismapping is frequently the leading source of errors.
I don't know of a better paper, I'm afraid. I believe that Li 2011 describes the algorithms used by samtools/bcftools for calculating genotype likelihoods and calling variants. There is also a --multiallelic calling model implemented in more recent versions of bcftools, which is briefly described here.
I highly appreciate any suggestion at this point!
you could use pipelines such as varsan or strelka
Hi, I am trying to understand how variant is called and reported with bcftools? What specific statistical model it used?