I was running a GWAS using REGENIE 3.2.5 on more than 250,000 samples, and the p-values returned are highly inflated with -log10P up to 5000. As a result there were over 10,000 variants called significant under the threshold of p < 5e-8, which is a huge increase in number compared with previous studies and therefore I am suspecting inflation of the values by some unknown reason. I have briefly checked the github repository of REGENIE (this, and this) and the issue of inflated p-values was reported but without a satisfying answer.
I used the same set of codes on other groups of smaller sample size from the same dataset, and the results were more expected with around 200 significant associations found.
Below is my pseudo code, any suggestion/ advice would be appreciated. Thank you!
plink2 \
--bfile bfile \
--mac 100 --geno 0.1 --hwe 1e-15 \
--mind 0.1 \
--keep eid.txt \
--write-snplist --write-samples --no-id-header \
--out qc_pass
# Total genotyping rate is 0.969388.
# 784256 variants and 488377 people pass filters and QC.
plink2 \
--bgen chr${chr}.bgen ref-first \
--sample chr${chr}.sample \
--keep eid.txt \
--mind 0.1 --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-5 \
--export bgen-1.2 --out QCed_chr${chr} \
--memory 8000 require
# A total of 9255791 variants from chromosomes 1-22 and 259386 samples remained after filter
regenie \\
--step 1 \\
--bed ${PLINK_DATA_PREFIX} \\
--phenoFile $PHENO_FILE \\
--extract qc_pass.snplist \\
--bsize 1000 \\
--niter 30 \\
--threads 16 \\
--lowmem \\
--lowmem-prefix ${TEMP_DIR}/pred \\
--out step1 \\
# Fitting null model
# * bim : [bfile.bim] n_snps = 784256
# -keeping variants specified by --extract
# -number of variants remaining in the analysis = 589385
# -keeping and mean-imputing missing observations (done for each trait)
# -number of phenotyped individuals = 258874
# * number of individuals used in analysis = 258874
regenie \\
--step 2 \\
--bgen QCed_chr${chr}.bgen \\
--sample QCed_chr${chr}.sample \\
--ref-first \\
--phenoFile $PHENO_FILE \\
--chr ${chr} \\
--pred $STEP1_PRED_FILE \\
--bsize 400 \\
--threads 8 \\
--gz \\
--out step2_chr${chr} \\
# Association testing mode with multithreading using OpenMP
# * bgen : [QCed_chr1.bgen]
# -summary : bgen file (v1.2 layout, zlib compressed) with 259386 named samples and 715235 variants with 16-bit encoding.
# -keeping variants specified by --extract
# -sample file: QCed_chr1.sample
# -keeping only individuals specified by --keep
# * phenotypes : [phenofile] n_pheno = 10
# -number of phenotyped individuals = 258874
# * number of individuals used in analysis = 258874
# * # threads : [8]
# * block size : [400]
# * # blocks : [1787]
# * approximate memory usage : 2GB
# * using minimum MAC of 5 (variants with lower MAC are ignored)
# * user specified to test only on select chromosomes
Hi! Just jumping in with a suggestion rather than an answer, could it be that the p-values you're getting are not adjusted for multiple comparisons?
Yet you chose to add an answer rather than a comment. I've moved it to a comment now, please be more mindful in the future.
Thanks SushiRoll I was told to use standard GWAS statistical significance cutoff which is 5e-8. I know this might be too "nonconservative". I am also confused on whether I should use the crude p-value or the FDR corrected "q-value" when compare against the 5e-8 cutoff?