Question

PLINK --geno and --mind

0

Entering edit mode

22 months ago

Ryon • 0

Hi everyone I am new to using PLINK and I am doing a final year project where I am generating a polygenic risk score for predicting COVID-19 Severity and Susceptibility.

My base data comes from a GWAS summary statistic file from the HGI database and my target data comes from the UKBiobank data. I am very unfamiliar with PLINK and generating a polygenic risk score and I feel like I have been thrown into the deep end as an undergraduate student so any help would truly be much appreciated!

From my base data I have identified around 350,000 SNPs that are associated with COVID outcomes and I extracted the genotype data of the individuals from the UKBiobank (UKB) data for all 350,000 SNPs. I am now in the process of trying to quality control this target data on PLINK (I have been following this guide: https://choishingwan.github.io/PRS-Tutorial/target/) which is where I had an error, the code I used is below: (I am using plink1.9)

./plink1 \
    --bfile B2_SA_tester \
    --maf 0.01 \
    --hwe 1e-6 \
    --geno 0.01 \
    --mind 0.01 \
    --write-snplist \
    --make-just-fam \
    --out B2_SA_STEP1QC_DONE

This is the error I got:

vboxuser@ubuntunagain:~/plink2_linux/B2_EUR_SAsians$ ./B2_SA_step1QC 
PLINK v1.90b6.26 64-bit (2 Apr 2022)           www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to B2_SA_STEP1QC_DONE.log.
Options in effect:
  --bfile B2_SA_tester
  --geno 0.01
  --hwe 1e-6
  --maf 0.01
  --make-just-fam
  --mind 0.01
  --out B2_SA_STEP1QC_DONE
  --write-snplist

18354 MB RAM detected; reserving 9177 MB for main workspace.
358725 variants loaded from .bim file.
7627 people (4109 males, 3518 females) loaded from .fam.
Error: All people removed due to missing genotype data (--mind).
IDs written to B2_SA_STEP1QC_DONE.irem .

How is it possible that all of my samples are missing their genotype data? I checked and the .bed file is there so I am not sure why it is saying this.

I have a theory which is that this target data only contains the genotype data for 300,000 SNP locations and does not contain the complete genotype data for these samples/individual humans and maybe because of this PLINK thinks the data is flawed since it only shows the genotype data for 350,000 locations? The reason why I only extracted the genotype data for the 350,000 SNP locations is because there were only 350,000 significant SNPs based on my clumping code which in short told me that out of whole genotype data available only 350,000 SNPs are linked to COVID so I wanted to save time and space by only extracting the genotype data for these 300,000 SNPs.

To troubleshoot I removed the --mind code and it showed that my genotype rate was 0.95XXX.

Then the --geno command said that it removed ~250,000 out of the ~350,000 SNPs I had, is this to be expected? because that is quite a lot! Below is the code when I removed --mind and kept --geno.

vboxuser@ubuntunagain:~/plink2_linux/B2_EUR_SAsians$ ./B2_SA_step1QC 
PLINK v1.90b6.26 64-bit (2 Apr 2022)           www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to B2_SA_STEP1QC_DONE.log.
Options in effect:
  --bfile B2_SA_tester
  --geno 0.01
  --hwe 1e-6
  --maf 0.01
  --make-just-fam
  --out B2_SA_STEP1QC_DONE
  --write-snplist

18354 MB RAM detected; reserving 9177 MB for main workspace.
358725 variants loaded from .bim file.
7627 people (4109 males, 3518 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 7627 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.95745.
210711 variants removed due to missing genotype data (--geno).
--hwe: 313 variants removed due to Hardy-Weinberg exact test.
75038 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
72663 variants and 7627 people pass filters and QC.
Note: No phenotypes present.
List of variant IDs written to B2_SA_STEP1QC_DONE.snplist .
--make-just-fam to B2_SA_STEP1QC_DONE.fam ... done.

PLINK COVID-19 • 5.6k views

ADD COMMENT • link updated 14 months ago by karthick ▴ 10 • written 22 months ago by Ryon • 0

1

Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

code_formatting

ADD REPLY • link 22 months ago by Ram 44k

Ram · Answer 1 · 2023-01-13

1

Entering edit mode

22 months ago

chrchang523 11k

Try looser --geno and --mind thresholds. 0.1 is a common choice, and should let you keep your good samples and variants when the overall genotyping rate is >95%.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 22 months ago by chrchang523 11k

0

Entering edit mode

Thank you, if you were to justify this change in a research paper how would you explain why you picked this threshold of 0.1? Would it just be because its commonly used?

ADD REPLY • link 22 months ago by Ryon • 0

0

Entering edit mode

It's actually the default value for both of these flags, when no parameters are provided. The tighter 0.01 threshold in the workflow you followed is better when your data is clean enough, but obviously you are not in that situation.

ADD REPLY • link 22 months ago by chrchang523 11k

0

Entering edit mode

Oh I see, thank you!

Below I have made the recommended changes, but it still removed a lot of my SNPs, around 300,000. Is this to be expected? Or is ~300,000 out of the ~350,000 quite a lot? I don't really know how much is considered "normal" or if there is even a "normal" range I should be expecting? Do you or anyone have any idea of what it should be around?

vboxuser@ubuntunagain:~/plink2_linux/B2_EUR_SAsians$ ./B2_SA_step1QC 
PLINK v1.90b6.26 64-bit (2 Apr 2022)           www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to B2_SA_STEP1QC_DONE.log.
Options in effect:
  --bfile B2_SA_tester
  --geno 0.1
  --hwe 1e-6
  --maf 0.1
  --make-just-fam
  --out B2_SA_STEP1QC_DONE
  --write-snplist

18354 MB RAM detected; reserving 9177 MB for main workspace.
358725 variants loaded from .bim file.
7627 people (4109 males, 3518 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 7627 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.95745.
35750 variants removed due to missing genotype data (--geno).
--hwe: 389 variants removed due to Hardy-Weinberg exact test.
266567 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
56019 variants and 7627 people pass filters and QC.
Note: No phenotypes present.
List of variant IDs written to B2_SA_STEP1QC_DONE.snplist .
--make-just-fam to B2_SA_STEP1QC_DONE.fam ... done.

ADD REPLY • link 22 months ago by Ryon • 0

0

Entering edit mode

I recommended changing the --mind argument to 0.1, not the --maf argument.

The --maf 0.01 filter is appropriate for the number of samples you have. As a rough guideline, if you have n observations to work with, each of which has two possibilities, you don't start running into insufficient-sample-size problems until you're dealing with frequencies lower than 1/sqrt(n). And in this case you have up to 7627 * 2 = 15254 allele observations per variant, so any variant with MAF >= 0.01 (as well as slightly lower) is fine under this guideline.

ADD REPLY • link 22 months ago by chrchang523 11k

0

Entering edit mode

why important snps remove in --geno 0.2 filter steps ? i have gwas study in plink but remove important snps in alzheimer disease ( rs429358) --geno step. pls explain @chrchang523

ADD REPLY • link 14 months ago by karthick ▴ 10