Question

PRS in UK Biobank - no covariate file and no phenotype file

1

Entering edit mode

3.2 years ago

Jalil Sharif ▴ 80

Hi there, I am trying to undertake a PRS using UK Biobank plink data. I am trying to generate a PRS using PRSice-2. However, the issue I am having is that I do not have a covariate file nor a phenotype file. I would like to know, how to generate them.

Thanks

UK Biobank PRS • 4.7k views

ADD COMMENT • link 3.1 years ago by Jalil Sharif ▴ 80

score 0 · Answer 1 · 2021-09-14

You should at least have a phenotype of interest for you to work on. If not, then you need to better define what you are trying to do for us to help you.

Depends on your phenotype, you will usually include the PCs, Genotyping batch, Accessment centre, and maybe sex and age. All of those information should come with your UK biobank application. I am not sure how your UK biobank data were organized so it is rather difficult for me to give direct advice. A more general guide can be found here: https://choishingwan.gitlab.io/ukb-administration/

score 0 · Answer 2 · 2021-09-15

0

Entering edit mode

3.2 years ago

Jalil Sharif ▴ 80

Hi Sam,

I have access to two different target data-sets. The first data-set has .bed .bim .fam files. The second has .bed .bim .bgen .bgen.bgi. The latter doesn't have .fam so I can't run the QC for the second dataset. There is no covariate file. The phenotype I am trying to look at is parkinson's disease.

ADD COMMENT • link 3.2 years ago by Jalil Sharif ▴ 80

0

Entering edit mode

If not already included in your ukb application, you can update your data basket to include for example the ICD10 codes (Data-Field 41202), and retrieve the subjects having parkinson's disease. Then make your phenotype file yourself.

ADD REPLY • link 3.2 years ago by Mathias ▴ 90

0

Entering edit mode

Is there any difference between data-field 41202 and 41270?

ADD REPLY • link 3.2 years ago by Jalil Sharif ▴ 80

0

Entering edit mode

check out the 'notes' section on these fields: 41202 41270 You will see that 41202 summarizes the main diagnosis. As Sam said, there are multiple ICD fields, you'll have to browse the portal and study which ICD field suits your research question.

ADD REPLY • link 3.2 years ago by Mathias ▴ 90

0

Entering edit mode

There should be a .sample file for your bgen files, which act as thefam file for your bgen.

As for covariate, it should always come with your application if you have access to the genotype data. You just need to extract them from the phenotype file. For example, PCs has a field ID of 22009 (40 Arrays, one for each PC), genotype batch is field 22000, sex is 31 age is 21003 and assessment centre is 54. There are multiple ICD fields, and you might have to search for them yourselves (too lazy to type them all out)

ADD REPLY • link 3.2 years ago by Sam ★ 4.8k

0

Entering edit mode

To clarify, I would stratify the cohort according to age, gender, ethnicity, genotype batch, etc?

To reduce confounding, how would you use the data from the multiple field ID in a PRSice pipeline?

ADD REPLY • link 3.2 years ago by Jalil Sharif ▴ 80

0

Entering edit mode

You would include those information as a covariate. For PRSice, that will be the --cov parameter. And for things that are coded as factor (e.g. batch and centre), you should provide them through --cov-factor

ADD REPLY • link 3.2 years ago by Sam ★ 4.8k

0

Entering edit mode

Thank you, I have the right target data and will also extract the phenotype and covariate data.

While running the first script for QCing the target data. I got the following "error".

7402791 variants loaded from .bim file.
487409 people (0 males, 0 females, 487409 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /path/to/file

Am I missing something or why is the gender ambiguous for the UK biobank dataset? Has it something to do with the fact that each chromosome has it's own files?

The code i am running is as follows:

plink \
    --bfile ~/path/to/file/ukb_imp_chr1 \
    --maf 0.01 \
    --hwe 1e-6 \
    --geno 0.01 \
    --mind 0.01 \
    --write-snplist \
    --make-just-fam \
    --out ~/path/to/file/ukb_imp_chr1.QC

ADD REPLY • link 3.2 years ago by Jalil Sharif ▴ 80

0

Entering edit mode

UK Biobank did not store the sex information to the fam file. You will need to extract those from the phenotype data base.

ADD REPLY • link 3.2 years ago by Sam ★ 4.8k

score 0 · Answer 3 · 2021-09-24

0

Entering edit mode

3.2 years ago

Jalil Sharif ▴ 80

I have access to the phenotype dataset, and extracted field 31 (gender) and 21003 (age), I want to clarify the headers for the files e.g. my 31.csv has the following header eid and 31-0.0 and similar header for the age file, do I have to rename the headers and just pass them through as -cov 31.csv 21003.csv?

For example:

plink \
    --bfile ~/path/to/file/ukb_imp_chr1 \
    --maf 0.01 \
    --hwe 1e-6 \
    --geno 0.01 \
    --mind 0.01 \
    --write-snplist \
    --make-just-fam \
    --cov ~/31.csv ~/21003.csv \
    --out ~/path/to/file/ukb_imp_chr1.QC

ADD COMMENT • link 3.2 years ago by Jalil Sharif ▴ 80

0

Entering edit mode

You need to put them in the same file, and then provide the file to plink. I'd suggest you to read PLINK's manual

ADD REPLY • link 3.2 years ago by Sam ★ 4.8k

0

Entering edit mode

Thank You.

Okay, I have done so, and I generated my phenotype file. With the PC, your tutorial mentions using 6, in this instance would you still use 6 or all 40?

Additionally do you need to QC for missingness in the UK biobank .bed/.bim/.fam files, the files I have have been imputed, so there shouldn't be any missingness?

ADD REPLY • link 3.2 years ago by Jalil Sharif ▴ 80

0

Entering edit mode

It all depends on your hypothesis. Sometime we adjust for 6, sometime 15, sometime 40. It is part of the analysis for you to determine what is the best PCs for you to use.

If it is imputed, then the data should be in .bgen format. Even then, you should still filter by info score, which indicate the quality of imputation.

ADD REPLY • link 3.2 years ago by Sam ★ 4.8k

score 0 · Answer 4 · 2021-10-06

I am now at the following step and I ran:

plink2 /path/to/file/ukb_imp_chr1 --fam /rds/general/user/js4120/home/fam_files/ukb_imp_chr.fam --extract /path/to/file/ukb_imp_chr1.QC.snplist --indep-pairwise 200 50 0.25 --out /path/to/file/ukb_imp_chr1.QC

I then ran:

plink2 /path/to/file/ukb_imp_chr1 --fam /rds/general/user/js4120/home/fam_files/ukb_imp_chr.fam --extract /path/to/file/ukb_imp_chr1.QC.snplist --set-all-var-ids @:# --new-id-max-allele-len 1000 --rm-dup retain-mismatch  --indep-pairwise 200 50 0.25 --out /path/to/file/ukb_imp_chr1.QC

In both instances I am getting the following error:

Error: --indep-pairwise requires unique variant IDs. (--set-all-var-ids and/or
--rm-dup may help.)