Start processing pgs_test

Question

Preparing Target Data

1

Entering edit mode

3.5 years ago

Jenish ▴ 20

Hi! I have a PGS file with the weights for each of the variants for a specific disease, CAD. The goal is to now calculate PRS scores for individuals in the UK Biobank genetic data. I have the plink bed, bim and fam files from the UK biobank data. What would be the steps to prepare the "Target data" from the uk biobank data?

I understand that using the PGS file as the "base data" requires adding a "fake" p-value column. I have tried using the bed, bim and fam files from uk biobank along with a GWAS file as the base data to which it gives: "Error: All sample has invalid phenotypes!", "Errorr: No sample left" . What am I missing here?

> PRSice 2.3.5 (2021-04-06) 
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2021-06-02 18:41:16
./bin/PRSice \
    --a1 a1 \
    --a2 a2 \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base CAD_UKBIOBANK.gz \
    --beta  \
    --binary-target F \
    --bp bp \
    --chr chr \
    --extract PRSice.valid \
    --interval 5e-05 \
    --lower 5e-08 \
    --no-clump  \
    --num-auto 22 \
    --out PRSice \
    --pvalue pval \
    --seed 1999345467 \
    --snp oldID \
    --stat beta \
    --target chr# \
    --thread 4 \
    --upper 0.5

Initializing Genotype file: chr# (bed) 

Start processing CAD_UKBIOBANK 
================================================== 

SNP extraction/exclusion list contains 5 columns, will 
assume first column contains the SNP ID 

Base file: /shared/Jenish/CAD_UKBIOBANK.gz 
GZ file detected. Header of file is: 
uniqid chr bp a1 a2 beta se pval N af oldID info zval 

Reading 100.00%
7947837 variant(s) observed in base file, with: 
1202749 variant(s) excluded based on user input 
6745088 total variant(s) included from base file 

Loading Genotype info from target 
================================================== 

488377 people (223459 male(s), 264780 female(s)) observed 
488377 founder(s) included 

181798 variant(s) not found in previous data 
602458 variant(s) included 

There are a total of 1 phenotype to process 

Processing the 1 th phenotype 

Error: All sample has invalid phenotypes! 
Error: No sample left 

Error: 
Execution halted

enter code here

PGS PRS PRSice • 2.6k views

ADD COMMENT • link updated 3.5 years ago by Sam ★ 4.8k • written 3.5 years ago by Jenish ▴ 20

score 2 · Accepted Answer · 2021-06-02

2

Entering edit mode

3.5 years ago

Sam ★ 4.8k

You also need the --no-clump --no-regress --bar-levels 1 --fastscore parameters to tell PRSice not to perform clumping (as I supposed you are using pre-computed effect sizes, which has already accounted for LD) and not to perform the regression as you don't need to optimize the p-value thresholds

(I assume you are using PRSice as I recognize the error message)

ADD COMMENT • link 3.5 years ago by Sam ★ 4.8k

0

Entering edit mode

Yes I am using PRSice.

The log that I added to the post above shows the run where I am using a GWAS summary statistic file as the base data and the UK biobank files (bed, bim and fam) as the target data.

I have also tried using a PGS file with weights and as you suggested, I added in the parameters. I added a pvalue of 1 to each row in the PGS file. And here's the log for it.

Rscript PRSice.R --prsice bin/PRSice --base pgs_test.txt --snp rsID --a1 effect_allele --stat effect_weight --pvalue pval --target chr# --thread 4 --binary-target F --no-clump --bar-levels 1 --fastscore --or --no-regress

PRSice 2.3.5 (2021-04-06) https://github.com/choishingwan/PRSice (C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2021-06-02 19:22:55 ./bin/PRSice \ --a1 effect_allele \ --bar-levels 1 \ --base pgs_test.txt \ --binary-target F \ --fastscore \ --no-clump \ --no-regress \ --num-auto 22 \ --or \ --out PRSice \ --pvalue pval \ --seed 174295784 \ --snp rsID \ --stat effect_weight \ --target chr# \ --thread 4

Initializing Genotype file: chr# (bed)

Start processing pgs_test

Base file: pgs_test.txt Header of file is: rsID effect_allele effect_weight pval

Reading 100.00% 49310 variant(s) observed in base file, with: 49310 NA stat/p-value observed 0 total variant(s) included from base file

Error: No valid variant remaining

Error: Execution halted

ADD REPLY • link 3.5 years ago by Jenish ▴ 20

0

Entering edit mode

Can you check if your p-value and stats are correct? e.g. not NA? What's the output of head pgs_test.txt?

ADD REPLY • link 3.5 years ago by Sam ★ 4.8k

0

Entering edit mode

That was a good check, since I had modified the PGS file using pandas I forgot to remove the index when writing to file. It works now. (Should have checked that, the error was self explanatory.) Thanks for the help!

The resulting PRS scores should be stored in PRSice.all_score file, correct?

ADD REPLY • link 3.5 years ago by Jenish ▴ 20

1

Entering edit mode

That's correct

ADD REPLY • link 3.5 years ago by Sam ★ 4.8k

0

Entering edit mode

The PRS scores for all the subjects in the UK biobank is ~0.007. Is that expected? I guess it depends on the PGS catalog file being used, but could there be any other factors that might cause the PRS scores to be similar for all the subjects in the target data?

ADD REPLY • link 3.5 years ago by Jenish ▴ 20

1

Entering edit mode

No, likely the overlap between the base and target are low and the resulting PRS are only comprised of a few SNPs (UK Biobank has a small overlap with HapMap3, which might have lead to this problem)

ADD REPLY • link 3.5 years ago by Sam ★ 4.8k

0

Entering edit mode

That makes sense. Thanks.

Just an additional query about PRSice -- using chr# for --target takes chromosomes 1 through 22 only correct? It does not utilize the X, Y, XY and MT files?