Question

Plink removing a lot of samples

3

Entering edit mode

6 months ago

Steven ▴ 50

I am running plink2 (step 2) on the UK-Biobank Research Analysis Platform as below:

    exome_file_dir="/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release"
data_field="ukb23158"
data_file_dir="/GRCh38"
for chr in {1..22}; do run_plink_wes="plink2 --bfile ${data_field}_c${chr}_b0_v1 --no-pheno --keep diabetes_wes_full.phe --autosome --maf 0.01 --mac 20 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out WES_c${chr}_snps_qc_pass"; 
dx run swiss-army-knife -iin="${exome_file_dir}/${data_field}_c${chr}_b0_v1.bed" -iin="${exome_file_dir}/${data_field}_c${chr}_b0_v1.bim" -iin="${exome_file_dir}/${data_field}_c${chr}_b0_v1.fam" -iin="${data_file_dir}/diabetes_wes_full.phe" -icmd="${run_plink_wes}" --tag="Step2" --instance-type "mem1_ssd1_v2_x16" --destination="${data_file_dir}" --brief --yes
done

It seems to be removing a large number of my samples -- when examining the log (partial below) only 25,832 samples remain for analysis. Within the .phe file loaded there are 276,167 sample IDs, which match with the .fam file.
Any clues as to why so many samples are being removed at this stage?

I have been following the Diabetes GWAS tutorial with regenie

 Using up to 16 threads (change this with --threads).  469835 samples (254489 females, 215074 males, 272 ambiguous; 469835 founders)  loaded from ukb23158_c1_b0_v1.fam.   2687650 variants loaded from ukb23158_c1_b0_v1.bim.  Note: No phenotype data present.   --keep: 25832 samples remaining.  Calculating sample missingness rates...   0 samples removed due to missing genotype data (--mind).  25832 samples (13937 females, 11895 males; 25832 founders) remaining after main filters. 
--write-samples: Sample IDs written to WES_c1_snps_qc_pass.id .   Calculating allele frequencies...   
--geno: 37904 variants removed due to missing genotype data.  
--hwe: 4501 variants removed due to Hardy-Weinberg exact test (founders only).  2628773 variants removed due to allele frequency threshold(s) (--maf/--max-maf/--mac/--max-mac).  16472 variants remaining after main filters.
--write-snplist:  Variant IDs written to WES_c1_snps_qc_pass.snplist .   End time: Mon Feb 17 12:23:43 2025 set +x uploading file: /home/dnanexus/out/out/WES_c1_snps_qc_pass.log -> /WES_c1_snps_qc_pass.log uploading file: /home/dnanexus/out/out/WES_c1_snps_qc_pass.id -> /WES_c1_snps_qc_pass.id uploading file: /home/dnanexus/out/out/WES_c1_snps_qc_pass.snplist -> /WES_c1_snps_qc_pass.snplist END_LOG

PLINK2 plink GWAS • 1.1k views

ADD COMMENT • link 5 months ago by Steven ▴ 50

0

Entering edit mode

Why do you think ALL the .phe file IDs match those in the .fam file?

ADD REPLY • link 6 months ago by chrchang523 11k

score 0 · Answer 1 · 2025-02-26

Well - as it turns out the answer is rather straightforward, the UK Biobank scrambles sample IDs between projects. The liftover file we were using (go from build 37 to 38) was previously performed to save on analysis costs. This is why so many samples were being removed from analysis, but curiously not all of them.