Dear all,
I am trying to call SNPS across 150 individuals of a non-model species genotyped using a WGS resequencing approach.
In short: I aligned reads from each sample against the reference using BWA and subsequently used bcftools mpileup to calculate the counts and bcftools call to call genotypes. I performed this on each sample separatedly and allowed to call the consensus (i.e. equal to reference) genotypes. I then used bcftools merge to create a unique vcf file containing all the samples and filtered for missing rates.
I now want to perform a quality filter to remove genotypes with low read counts. The problem is that I noticed that heterozygous genotypes have usually more read counts than homozygotes. For this reason, filtering for read count (DP) produces a dataset where it is rare to observe a SNP with three genotypes, which doesn't make much sense...
Is it normal that heterozygotes genotypes have more DP, comapred to homozygotes? If not, what could be the cause? If yes, how can I deal with this during filtering of the vcf?
thank you in advance
OS
Hello,
can you please post some example lines of your
vcf
, where the differences you mentioned can be seen?fin swimmer
It's not easy to generalize as there are 150 individuals x 10 M SNPs. Anyway, here I show three examples of SNPs predominantly homozygous-consensus, heterozygous and homozygous-alternative.
Homozygous-consensus:
Heterozygous:
Homozygous-alternative: