Hello,
I am working with whole genome resequencing data from non-reference organisms. I am working with low-to-medium depth data (8X-20X) and as expected, there is a bias towards the reference allele during mapping and genotype calling. I have now encountered several scenarios where this bias overwhelms biological signal and misleads analyses (e.g., incorrect placement of low depth samples on phylogenetic trees, incorrect inference of gene flow between reference population and a low-depth sample). Looking at the individual genotype calls that are driving some of these patterns, I have found many instances where the low depth sample is called as heterozygous despite having no reference alleles at that position. For example, it might have a four alternate alleles and zero reference alleles, but get called as a heterozygote, presumably because the alternate allele has very low frequency and so the genotype caller decides that it is more likely that the low depth sample is a heterozygote that by chance didn't get the reference allele sequenced, than that it is really homozygous for the alternate allele. However, this severely biases my analyses, so I do not want any cases of a heterozygote being inferred if one of the alleles is not actually present in the sequencing data! I would much, much rather have a lot of missing heterozygote calls than have even a few false heterozygote calls. Does anyone know how to disable this type of false inference of heterozygotes, or to reduce the severity of reference bias during genotype calling?
I am using bcftools to call genotypes with a command like this, after mapping to the reference genome with bwa:
bcftools mpileup -B -C 50 -a AD,DP,SP -Ou -f ../reference.fa --threads 10 -b bam.list | bcftools call --samples-file sample_file --ploidy-file ploidy_file --threads 10 -f GQ,GP -m -O z - > allsites.vcf.gz
(where bam.list is a list of bam files for my populations)
I see there is an option no-reference
to not give bcftools a reference genome, but I can't find much documentation about what effect this has. Would this reduce reference bias by not letting bcftools know what is the "reference" vs "alternate" allele?