I'm working with 70 samples of whole genome data (the whole genome of 70 individuals). I'm returning to the field after some time away from my academic training, and I don't have a lot of experience with whole genome data from which to draw comparisons. At this time, I dont have any demographic info about the samples.
I've been provided with BAM files, against which samtools mpileup
and bcftools call
has been run. Then the BCF files were bcftools merge
-ed into one large file of 70 samples.
I'm currently working on validating the SNPs in the BCF files. One interesting data curve I've noticed is the Allele-frequency/number-of-SNPs.
So this is the AF on the X, and number of SNPs at this AF on Y. The "jump point" is at AF=0.49...so basically, where the SNPs go from hetero- to homozygous.
I would intuitively expect (in a random population) that this curve would be closer to a bell curve...or at least more randomly distributed. And yet this is quite the opposite.
I'm also curious if the Ti/Tv ratio is significant...as there's a spike at the same place, and I wouldn't expect to see the ratio change amongst SNPs at all (should I?)
Could you add axes labels also on the plots shown, to help readers?
Did you set missing genotypes to ref (-0) when using bcftools merge?
Hello, yes, the spike at 0.49 is unusual for a multi-sample VCF/BCF, but it could be explained if, when you merged your samples, the AF field was not updated by BCFtools. Did your colleagues split multi-allelic sites and left-align indels prior to merging?
I would fully expect, however, a binary distribution for AF (0.5 or 1.0) if I was analysing a single-sample VCF. If you have a single-sample VCF, then your AF should be either 0.5 (heterozygous with AC=1) or 1.0 (homozygous with AC=2). If you then merge this with another sample that does not contain the same variant, then you will get the following:
The phenomenon that you see may in addition be related to sample population. You have not mentioned from where they are based. If it's a community that has not mixed (genetically) with other groups of people for geographical, religious, ethnic, or other reason, then I would fully expect a skewed AF distribution.