Unusual AF data (I think)
0
0
Entering edit mode
7.1 years ago
rightmirem ▴ 70

I'm working with 70 samples of whole genome data (the whole genome of 70 individuals). I'm returning to the field after some time away from my academic training, and I don't have a lot of experience with whole genome data from which to draw comparisons. At this time, I dont have any demographic info about the samples.

I've been provided with BAM files, against which samtools mpileupand bcftools call has been run. Then the BCF files were bcftools merge-ed into one large file of 70 samples.

I'm currently working on validating the SNPs in the BCF files. One interesting data curve I've noticed is the Allele-frequency/number-of-SNPs.

Graphs

So this is the AF on the X, and number of SNPs at this AF on Y. The "jump point" is at AF=0.49...so basically, where the SNPs go from hetero- to homozygous.

I would intuitively expect (in a random population) that this curve would be closer to a bell curve...or at least more randomly distributed. And yet this is quite the opposite.

I'm also curious if the Ti/Tv ratio is significant...as there's a spike at the same place, and I wouldn't expect to see the ratio change amongst SNPs at all (should I?)

SNP genome next-gen • 1.3k views
ADD COMMENT
0
Entering edit mode

Could you add axes labels also on the plots shown, to help readers?

ADD REPLY
0
Entering edit mode

Did you set missing genotypes to ref (-0) when using bcftools merge?

ADD REPLY
0
Entering edit mode

Hello, yes, the spike at 0.49 is unusual for a multi-sample VCF/BCF, but it could be explained if, when you merged your samples, the AF field was not updated by BCFtools. Did your colleagues split multi-allelic sites and left-align indels prior to merging?

I would fully expect, however, a binary distribution for AF (0.5 or 1.0) if I was analysing a single-sample VCF. If you have a single-sample VCF, then your AF should be either 0.5 (heterozygous with AC=1) or 1.0 (homozygous with AC=2). If you then merge this with another sample that does not contain the same variant, then you will get the following:

  • AC=1
  • AN=4 (2 samples * 2 chromosomes)
  • AF=0.25 (AC / AN)

The phenomenon that you see may in addition be related to sample population. You have not mentioned from where they are based. If it's a community that has not mixed (genetically) with other groups of people for geographical, religious, ethnic, or other reason, then I would fully expect a skewed AF distribution.

ADD REPLY

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6