Hi everyone I am new to using PLINK and I am doing a final year project where I am generating a polygenic risk score for predicting COVID-19 Severity and Susceptibility.
My base data comes from a GWAS summary statistic file from the HGI database and my target data comes from the UKBiobank data. I am very unfamiliar with PLINK and generating a polygenic risk score and I feel like I have been thrown into the deep end as an undergraduate student so any help would truly be much appreciated!
From my base data I have identified around 350,000 SNPs that are associated with COVID outcomes and I extracted the genotype data of the individuals from the UKBiobank (UKB) data for all 350,000 SNPs. I am now in the process of trying to quality control this target data on PLINK (I have been following this guide: https://choishingwan.github.io/PRS-Tutorial/target/) which is where I had an error, the code I used is below: (I am using plink1.9)
./plink1 \
--bfile B2_SA_tester \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.01 \
--mind 0.01 \
--write-snplist \
--make-just-fam \
--out B2_SA_STEP1QC_DONE
This is the error I got:
vboxuser@ubuntunagain:~/plink2_linux/B2_EUR_SAsians$ ./B2_SA_step1QC
PLINK v1.90b6.26 64-bit (2 Apr 2022) www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to B2_SA_STEP1QC_DONE.log.
Options in effect:
--bfile B2_SA_tester
--geno 0.01
--hwe 1e-6
--maf 0.01
--make-just-fam
--mind 0.01
--out B2_SA_STEP1QC_DONE
--write-snplist
18354 MB RAM detected; reserving 9177 MB for main workspace.
358725 variants loaded from .bim file.
7627 people (4109 males, 3518 females) loaded from .fam.
Error: All people removed due to missing genotype data (--mind).
IDs written to B2_SA_STEP1QC_DONE.irem .
How is it possible that all of my samples are missing their genotype data? I checked and the .bed file is there so I am not sure why it is saying this.
I have a theory which is that this target data only contains the genotype data for 300,000 SNP locations and does not contain the complete genotype data for these samples/individual humans and maybe because of this PLINK thinks the data is flawed since it only shows the genotype data for 350,000 locations? The reason why I only extracted the genotype data for the 350,000 SNP locations is because there were only 350,000 significant SNPs based on my clumping code which in short told me that out of whole genotype data available only 350,000 SNPs are linked to COVID so I wanted to save time and space by only extracting the genotype data for these 300,000 SNPs.
To troubleshoot I removed the --mind
code and it showed that my genotype rate was 0.95XXX.
Then the --geno
command said that it removed ~250,000 out of the ~350,000 SNPs I had, is this to be expected? because that is quite a lot! Below is the code when I removed --mind
and kept --geno
.
vboxuser@ubuntunagain:~/plink2_linux/B2_EUR_SAsians$ ./B2_SA_step1QC
PLINK v1.90b6.26 64-bit (2 Apr 2022) www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to B2_SA_STEP1QC_DONE.log.
Options in effect:
--bfile B2_SA_tester
--geno 0.01
--hwe 1e-6
--maf 0.01
--make-just-fam
--out B2_SA_STEP1QC_DONE
--write-snplist
18354 MB RAM detected; reserving 9177 MB for main workspace.
358725 variants loaded from .bim file.
7627 people (4109 males, 3518 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 7627 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.95745.
210711 variants removed due to missing genotype data (--geno).
--hwe: 313 variants removed due to Hardy-Weinberg exact test.
75038 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
72663 variants and 7627 people pass filters and QC.
Note: No phenotypes present.
List of variant IDs written to B2_SA_STEP1QC_DONE.snplist .
--make-just-fam to B2_SA_STEP1QC_DONE.fam ... done.
Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.