Question

What software to use to asses phenotypic information from ULBiobank dataset?

0

Entering edit mode

5.7 years ago

anamaria ▴ 220

Hello,

I got my data from UKbiobank, for 502536 subjects. I would like to determine which subjects have diabetic related complication in order to distinguish cases and controls and perform GWAS on that data.

Right now I can load my data in R:

library(ukbtools)
my_ukb_data <- ukb_df("ukb31212")

and to find ICD10 code names I can use this:

ukb_icd_keyword("diabetes", icd.version = 10)

and I get about 20 listed codes and their explanations. And the for example for E13 code:

> ukb_icd_prevalence(my_ukb_data, icd.version = 10, icd.diagnosis = "E13")
Error in ukb_icd_prevalence(my_ukb_data, icd.version = 10, icd.diagnosis = "E13") : 
  unused argument (icd.diagnosis = "E13")

Is this issue with the software ukbtools? or there are no subjects in my dataset associated with this E13? Do you have any other software to recommend for exploring/assessing diabetic complications from UKBiobank data?

Thanks

ukbiobank • 2.8k views

ADD COMMENT • link updated 5.7 years ago by ken.hanscombe ▴ 10 • written 5.7 years ago by anamaria ▴ 220

0

Entering edit mode

The UKB supplied programs (in particular ukbconv, https://biobank.ctsu.ox.ac.uk/crystal/download.cgi) allow you to decrypt and convert the data to any format your prefer. You are free to use R, Python, STATA or whatever statistical software you are most comfortable with to analyse the data.

I wrote the R package ukbtools https://kenhanscombe.github.io/ukbtools/index.html to remove the upfront data wrangling required to marry the separate pieces of data into a single dataframe and begin analysis. It includes functionality to query disease diagnoses and demographics. It is fully documented here https://kenhanscombe.github.io/ukbtools/reference/index.html

ADD REPLY • link 5.7 years ago by ken.hanscombe ▴ 10

score 2 · Answer 1 · 2019-07-09

2

Entering edit mode

5.7 years ago

Kevin Blighe 89k

Hey, you are not using the function correctly. Please review the correct syntax, here: https://www.rdocumentation.org/packages/ukbtools/versions/0.11.3/topics/ukb_icd_prevalence

Kevin

ADD COMMENT • link 5.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

thanks I will try it that way. I was following these instructions: https://cran.r-project.org/web/packages/ukbtools/vignettes/explore-ukb-data.html

BDW do you know how I would assess which phenotypes in my data are related to E13 ICD10 code? What would be command for that?

ADD REPLY • link 5.7 years ago by anamaria ▴ 220

0

Entering edit mode

also this doesn't give me anything, and the same is for a few other diabetes codes I tried

ukb_icd_prevalence(my_ukb_data, icd.code = "P70") [1] NaN ukb_icd_prevalence(my_ukb_data, icd.code = "P70.2") [1] NaN ukb_icd_prevalence(my_ukb_data, icd.code = "E13") [1] NaN dim(my_ukb_data) [1] 502536 131

ADD REPLY • link 5.7 years ago by anamaria ▴ 220

0

Entering edit mode

Hi Kevin, do you know how to download the ukbxxx.enc file mentioned by ukbtools?

ukb_unpack ukbxxxx.enc key
ukb_conv ukbxxxx.enc_ukb r
ukb_conv ukbxxxx.enc_ukb docs

ADD REPLY • link 4.4 years ago by Shicheng Guo ★ 9.6k

score 1 · Answer 2 · 2019-07-10

1

Entering edit mode

5.7 years ago

ken.hanscombe ▴ 10

All users of ukbtools benefit if there is a track record of issues raised. I raised an issue for you with one of your first email requests and it is still open https://github.com/kenhanscombe/ukbtools/issues/20

ukb_icd_keyword("diabetes", icd.version = 10) is working exactly as described in the documentation https://kenhanscombe.github.io/ukbtools/reference/index.html. It returns all ICD descriptions including the search term supplied.

NB. ukb_icd_keyword and ukb_icd_code_meaning query ICD tables supplied as datasets (icd10chapters, icd10codes, icd9chapters, icd9codes) with the package, and described in the documentation https://kenhanscombe.github.io/ukbtools/reference/index.html#section-datasets

A lot of your subsequent issues look like typos and/or incorrect use of the functionality.

ukb_icd_prevalence has no argument icd.diagnosis (which is what the generic R error is telling you). You need to read the documentation more carefully.

icd.code = "P70" and icd.code = "E13" work fine for me. icd.code = "P70.2" is not valid: no ICD codes in UKB data include a decimal point. Look at the data. Try icd.code = "P702".

What error are you getting exactly (for the valid codes)? Are you sure you have hospital episode statistics data ("diagnoses") in your UKB data?

NB. The argument to icd.code is a regular expression (as described in the documentation). To understand which codes you're requesting the frequency of, you can do a regex search on the supplied icd10codes dataset, e.g., filter(icd10codes, str_detect(code, "E13")). If you want the prevalence of a specific code, e.g. E13.2 With renal complications, it is safest to us icd.code = "^E13.2$".

ADD COMMENT • link 5.7 years ago by ken.hanscombe ▴ 10

0

Entering edit mode

Hi Ken,

thank you for those clarifications!

So what I am trying to use your code for is this: to relate my selected ICD10 codes (say H360) with selecting phenotypes and measurements given in my data file.

For example how would I identify/extract these 4235 individuals mentioned and the bellow page and defined them as my cases.

http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=6148

Also I tried using:

> filter(icd10codes, str_detect(code, "E13"))
Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)) : 
  object 'code' not found
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion
2: In data.matrix(data) : NAs introduced by coercion

what should be given there instead of "code" ?

files which I have available are these:

> list.files()
 [1] "archive.tar.gz"                "encoding.ukb"                 
 [3] "fields.ukb"                    "HESDataDic.xlsx"              
 [5] "HESTables.xlsx"                "HospitalEpisodeStatistics.pdf"
 [7] "k44316.key"                    "ukb31212.csv"                 
 [9] "ukb31212.enc"                  "ukb31212.enc_ukb"             
[11] "ukb31212.html"                 "ukb31212.log"                 
[13] "ukb31212.r"                    "ukb31212.tab"                 
[15] "ukbconv"                       "ukbgene"                      
[17] "ukbmd5"                        "ukbunpack"

I downloaded HES file from here: https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=2000

Please let me know if those are applicable HES files you are referring to?

Thank you for your help! Ana

ADD REPLY • link 5.7 years ago by anamaria ▴ 220

0

Entering edit mode

also is there is workaround this error that I am getting:

> my_ukb_data[1:3,1:3]
      eid sex_f31_0_0 year_of_birth_f34_0_0
1 1000017      Female                  1938
2 1000025      Female                  1951
3 1000038        Male                  1961

> ukb_icd_diagnosis(my_ukb_data, id = "1000017", icd.version = 10)
Error: Column 1 must be named.
Use .name_repair to specify repair.
Call `rlang::last_error()` to see a backtrace

ADD REPLY • link 5.7 years ago by anamaria ▴ 220

0

Entering edit mode

Also this command gives me always NaN, and I tried for multiple codes

     > ukb_icd_prevalence(my_ukb_data, icd.code = "H360")
      [1] NaN

ADD REPLY • link 5.7 years ago by anamaria ▴ 220

0

Entering edit mode

Hi Ken,

can you please explain what did you mean with:

filter(icd10codes, str_detect(code, "E13"))

what is code in this example?

What I want to do is to extract from my dataset cases which comply with these 2 definitions:

Data-Field 41270: E10.3,E11.3,E14.3,H360 + Data-Field 6148: Diabetes related eye disease (in questionnaire they answered Yes)

How do I do that using your code?

Thanks Ana

ADD REPLY • link 5.7 years ago by anamaria ▴ 220