Hi there,
I am trying to undertake a PRS using UK Biobank plink data. I am trying to generate a PRS using PRSice-2. However, the issue I am having is that I do not have a covariate file nor a phenotype file. I would like to know, how to generate them.
You should at least have a phenotype of interest for you to work on. If not, then you need to better define what you are trying to do for us to help you.
Depends on your phenotype, you will usually include the PCs, Genotyping batch, Accessment centre, and maybe sex and age. All of those information should come with your UK biobank application. I am not sure how your UK biobank data were organized so it is rather difficult for me to give direct advice. A more general guide can be found here: https://choishingwan.gitlab.io/ukb-administration/
I have access to two different target data-sets. The first data-set has .bed .bim .fam files. The second has .bed .bim .bgen .bgen.bgi. The latter doesn't have .fam so I can't run the QC for the second dataset. There is no covariate file. The phenotype I am trying to look at is parkinson's disease.
If not already included in your ukb application, you can update your data basket to include for example the ICD10 codes (Data-Field 41202), and retrieve the subjects having parkinson's disease. Then make your phenotype file yourself.
check out the 'notes' section on these fields: 4120241270
You will see that 41202 summarizes the main diagnosis. As Sam said, there are multiple ICD fields, you'll have to browse the portal and study which ICD field suits your research question.
There should be a .sample file for your bgen files, which act as thefam file for your bgen.
As for covariate, it should always come with your application if you have access to the genotype data. You just need to extract them from the phenotype file. For example, PCs has a field ID of 22009 (40 Arrays, one for each PC), genotype batch is field 22000, sex is 31 age is 21003 and assessment centre is 54. There are multiple ICD fields, and you might have to search for them yourselves (too lazy to type them all out)
You would include those information as a covariate. For PRSice, that will be the --cov parameter. And for things that are coded as factor (e.g. batch and centre), you should provide them through --cov-factor
Thank you, I have the right target data and will also extract the phenotype and covariate data.
While running the first script for QCing the target data. I got the following "error".
7402791 variants loaded from .bim file.
487409 people (0 males, 0 females, 487409 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /path/to/file
Am I missing something or why is the gender ambiguous for the UK biobank dataset? Has it something to do with the fact that each chromosome has it's own files?
I have access to the phenotype dataset, and extracted field 31 (gender) and 21003 (age), I want to clarify the headers for the files e.g. my 31.csv has the following header eid and 31-0.0 and similar header for the age file, do I have to rename the headers and just pass them through as -cov 31.csv 21003.csv?
Okay, I have done so, and I generated my phenotype file. With the PC, your tutorial mentions using 6, in this instance would you still use 6 or all 40?
Additionally do you need to QC for missingness in the UK biobank .bed/.bim/.fam files, the files I have have been imputed, so there shouldn't be any missingness?
It all depends on your hypothesis. Sometime we adjust for 6, sometime 15, sometime 40. It is part of the analysis for you to determine what is the best PCs for you to use.
If it is imputed, then the data should be in .bgen format. Even then, you should still filter by info score, which indicate the quality of imputation.
If not already included in your ukb application, you can update your data basket to include for example the ICD10 codes (Data-Field 41202), and retrieve the subjects having parkinson's disease. Then make your phenotype file yourself.
Is there any difference between data-field
41202
and41270
?check out the 'notes' section on these fields: 41202 41270 You will see that 41202 summarizes the main diagnosis. As Sam said, there are multiple ICD fields, you'll have to browse the portal and study which ICD field suits your research question.
There should be a .sample file for your bgen files, which act as the
fam
file for your bgen.As for covariate, it should always come with your application if you have access to the genotype data. You just need to extract them from the phenotype file. For example, PCs has a field ID of
22009
(40 Arrays, one for each PC), genotype batch is field22000
, sex is31
age is21003
and assessment centre is54
. There are multiple ICD fields, and you might have to search for them yourselves (too lazy to type them all out)To clarify, I would stratify the cohort according to age, gender, ethnicity, genotype batch, etc?
To reduce confounding, how would you use the data from the multiple field ID in a PRSice pipeline?
You would include those information as a covariate. For PRSice, that will be the
--cov
parameter. And for things that are coded as factor (e.g. batch and centre), you should provide them through--cov-factor
Thank you, I have the right target data and will also extract the phenotype and covariate data.
While running the first script for QCing the target data. I got the following "error".
Am I missing something or why is the gender ambiguous for the UK biobank dataset? Has it something to do with the fact that each chromosome has it's own files?
The code i am running is as follows:
UK Biobank did not store the sex information to the fam file. You will need to extract those from the phenotype data base.