I am running plink2 (step 2) on the UK-Biobank Research Analysis Platform as below:
exome_file_dir="/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release"
data_field="ukb23158"
data_file_dir="/GRCh38"
for chr in {1..22}; do run_plink_wes="plink2 --bfile ${data_field}_c${chr}_b0_v1 --no-pheno --keep diabetes_wes_full.phe --autosome --maf 0.01 --mac 20 --geno 0.1 --hwe 1e-15 --mind 0.1 --write-snplist --write-samples --no-id-header --out WES_c${chr}_snps_qc_pass";
dx run swiss-army-knife -iin="${exome_file_dir}/${data_field}_c${chr}_b0_v1.bed" -iin="${exome_file_dir}/${data_field}_c${chr}_b0_v1.bim" -iin="${exome_file_dir}/${data_field}_c${chr}_b0_v1.fam" -iin="${data_file_dir}/diabetes_wes_full.phe" -icmd="${run_plink_wes}" --tag="Step2" --instance-type "mem1_ssd1_v2_x16" --destination="${data_file_dir}" --brief --yes
done
It seems to be removing a large number of my samples -- when examining the log (partial below) only 25,832 samples remain for analysis. Within the .phe file loaded there are 276,167 sample IDs, which match with the .fam file.
Any clues as to why so many samples are being removed at this stage?
I have been following the Diabetes GWAS tutorial with regenie
Using up to 16 threads (change this with --threads). 469835 samples (254489 females, 215074 males, 272 ambiguous; 469835 founders) loaded from ukb23158_c1_b0_v1.fam. 2687650 variants loaded from ukb23158_c1_b0_v1.bim. Note: No phenotype data present. --keep: 25832 samples remaining. Calculating sample missingness rates... 0 samples removed due to missing genotype data (--mind). 25832 samples (13937 females, 11895 males; 25832 founders) remaining after main filters.
--write-samples: Sample IDs written to WES_c1_snps_qc_pass.id . Calculating allele frequencies...
--geno: 37904 variants removed due to missing genotype data.
--hwe: 4501 variants removed due to Hardy-Weinberg exact test (founders only). 2628773 variants removed due to allele frequency threshold(s) (--maf/--max-maf/--mac/--max-mac). 16472 variants remaining after main filters.
--write-snplist: Variant IDs written to WES_c1_snps_qc_pass.snplist . End time: Mon Feb 17 12:23:43 2025 set +x uploading file: /home/dnanexus/out/out/WES_c1_snps_qc_pass.log -> /WES_c1_snps_qc_pass.log uploading file: /home/dnanexus/out/out/WES_c1_snps_qc_pass.id -> /WES_c1_snps_qc_pass.id uploading file: /home/dnanexus/out/out/WES_c1_snps_qc_pass.snplist -> /WES_c1_snps_qc_pass.snplist END_LOG
Why do you think ALL the .phe file IDs match those in the .fam file?