Hi! I have a PGS file with the weights for each of the variants for a specific disease, CAD. The goal is to now calculate PRS scores for individuals in the UK Biobank genetic data. I have the plink bed, bim and fam files from the UK biobank data. What would be the steps to prepare the "Target data" from the uk biobank data?
I understand that using the PGS file as the "base data" requires adding a "fake" p-value column. I have tried using the bed, bim and fam files from uk biobank along with a GWAS file as the base data to which it gives: "Error: All sample has invalid phenotypes!", "Errorr: No sample left" . What am I missing here?
> PRSice 2.3.5 (2021-04-06)
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2021-06-02 18:41:16
./bin/PRSice \
--a1 a1 \
--a2 a2 \
--bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
--base CAD_UKBIOBANK.gz \
--beta \
--binary-target F \
--bp bp \
--chr chr \
--extract PRSice.valid \
--interval 5e-05 \
--lower 5e-08 \
--no-clump \
--num-auto 22 \
--out PRSice \
--pvalue pval \
--seed 1999345467 \
--snp oldID \
--stat beta \
--target chr# \
--thread 4 \
--upper 0.5
Initializing Genotype file: chr# (bed)
Start processing CAD_UKBIOBANK
==================================================
SNP extraction/exclusion list contains 5 columns, will
assume first column contains the SNP ID
Base file: /shared/Jenish/CAD_UKBIOBANK.gz
GZ file detected. Header of file is:
uniqid chr bp a1 a2 beta se pval N af oldID info zval
Reading 100.00%
7947837 variant(s) observed in base file, with:
1202749 variant(s) excluded based on user input
6745088 total variant(s) included from base file
Loading Genotype info from target
==================================================
488377 people (223459 male(s), 264780 female(s)) observed
488377 founder(s) included
181798 variant(s) not found in previous data
602458 variant(s) included
There are a total of 1 phenotype to process
Processing the 1 th phenotype
Error: All sample has invalid phenotypes!
Error: No sample left
Error:
Execution halted
enter code here
Yes I am using PRSice.
The log that I added to the post above shows the run where I am using a GWAS summary statistic file as the base data and the UK biobank files (bed, bim and fam) as the target data.
I have also tried using a PGS file with weights and as you suggested, I added in the parameters. I added a pvalue of 1 to each row in the PGS file. And here's the log for it.
PRSice 2.3.5 (2021-04-06) https://github.com/choishingwan/PRSice (C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2021-06-02 19:22:55 ./bin/PRSice \ --a1 effect_allele \ --bar-levels 1 \ --base pgs_test.txt \ --binary-target F \ --fastscore \ --no-clump \ --no-regress \ --num-auto 22 \ --or \ --out PRSice \ --pvalue pval \ --seed 174295784 \ --snp rsID \ --stat effect_weight \ --target chr# \ --thread 4
Initializing Genotype file: chr# (bed)
Start processing pgs_test
Base file: pgs_test.txt Header of file is: rsID effect_allele effect_weight pval
Reading 100.00% 49310 variant(s) observed in base file, with: 49310 NA stat/p-value observed 0 total variant(s) included from base file
Error: No valid variant remaining
Error: Execution halted
Can you check if your p-value and stats are correct? e.g. not NA? What's the output of
head pgs_test.txt
?That was a good check, since I had modified the PGS file using pandas I forgot to remove the index when writing to file. It works now. (Should have checked that, the error was self explanatory.) Thanks for the help!
The resulting PRS scores should be stored in PRSice.all_score file, correct?
That's correct
The PRS scores for all the subjects in the UK biobank is ~0.007. Is that expected? I guess it depends on the PGS catalog file being used, but could there be any other factors that might cause the PRS scores to be similar for all the subjects in the target data?
No, likely the overlap between the base and target are low and the resulting PRS are only comprised of a few SNPs (UK Biobank has a small overlap with HapMap3, which might have lead to this problem)
That makes sense. Thanks.
Just an additional query about PRSice -- using chr# for --target takes chromosomes 1 through 22 only correct? It does not utilize the X, Y, XY and MT files?
Yes, as of now, we haven't implement support beyond the first 22 chromosomes