Question

Inlfated QQ plot GWAS

0

Entering edit mode

12 months ago

Pedro • 0

Hi there,

I'm running a GWAS on close to 2000 animals using 500K SNP data in a repeatability model. My phenotypic dataset has 10 observations per animal on average. I'm using a GRM calculated with 20K SNP data to account for population stratification, even though my population is homogeneous (checked with a PCA). I'm not filtering for MAF. When I check the distribution of my p-values with a QQ plot, I get a strange pattern. The observed and expected values match up to -log(p-val) = 2. After this point it just looks like there's population structure. A bonferroni puts my threshold in -log(pval)> 7. An FDR with 5% puts my threshold in -log(pval) > 4. Can anyone offer an explanation on why this is happening or a way to deal with this?

Cheers

enter image description here

GWAS • 826 views

ADD COMMENT • link updated 12 months ago by LauferVA 4.8k • written 12 months ago by Pedro • 0

0

Entering edit mode

You mention homogeneity of an animal population. How large are the haploblocks, on average?

ADD REPLY • link 12 months ago by LauferVA 4.8k

score 0 · Answer 1 · 2024-08-05

Seeing "Animals" and "homogenous population" seems like a bit of a red flag. I would expect either some kind of cross-breeding experiment for trait/linkage studies, or pedigree-based experiments for breeding studies ("Homogenous population" seems borrowed from a sampling-based study). In crossing/breeding cases, the GRM is controlling for the pedigree structure as a source of stratification - in which case there should also be covariates associated with parental strains, grazing/housing/plot groups, and other environmental factors. Failing to account for these may result in an inflation of the test statistics.

Another possibility are batch effects on the genotyping chips, sample collection, library preparation date, etc. These can be largely mitigated by statistical genotype refinement (typically phasing/imputation), or by including indicators for those covariates. You may see differences in call rates or other array QC metrics that block out in some obvious way.

Another source of inflation is simple LD. Each associated variant will cause additional associations of everything in high LD with it. It may be better to determine LD blocks, and (when you make the Q-Q plot) choose a variant at random from each block.