A set of tools to create a UK Biobank < http://www.ukbiobank.ac.uk/> dataset from a UKB fileset (.tab, .r, .html), visualize primary demographic data for a sample subset, query ICD diagnoses, retrieve genetic metadata, read and write standard file formats for genetic analyses.
After downloading and decrypting your UK Biobank (UKB) data with the supplied UKB programs, you have multiple files that need to be brought together to give you a dataset to explore. The data file has column names that are edited field-codes from the UKB data showcase.
ukbtools makes it easy to collapse the multiple UKB files into a single dataset for analysis, in the process giving meaningful names to the variables. The package also includes functionality to retrieve ICD diagnoses, explore a sample subset in the context of the UKB sample, and collect genetic metadata.
# Install from CRANinstall.packages("ukbtools")# Install latest development versiondevtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE)
Note: This package is in beta - it is feature complete but may contain unknown bugs. If anything does not work, first re-install the package
devtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE, force = TRUE) to get the latest development version. If it is still not working, let me know and I'll fix it.
Download§ then decrypt your data and create a "UKB fileset" (.tab, .r, .html):
ukb_unpack ukbxxxx.enc keyukb_conv ukbxxxx.enc_ukb rukb_conv ukbxxxx.enc_ukb docs
ukb_unpack decrypts your downloaded
ukbxxxx.enc file, outputting a
ukb_conv with the
r flag converts the decrypted data to a tab-delimited file
ukbxxxx.tab and an R script
ukbxxxx.r that reads the tab file. The
docs flag creates an html file containing a field-code-to-description table (among others).
§ Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation.
ukb_df() takes two arguments, the stem of your fileset and the path, and returns a dataframe with usable column names. This will take a few minutes. The rate-limiting step is reading and parsing the code in the UKB-generated .r file - not
ukb_df per se.
library(ukbtools)my_ukb_data <- ukb_df("ukbxxxx")
You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data
my_ukb_data <- ukb_df("ukbxxxx", path = "/full/path/to/my/data/")
Note: You can move the three files in your fileset after creating them with
ukb_conv, but they should be kept together.
ukb_df() automatically updates the read call in the R source file to point to the correct directory (the current directory by default, or a directory specified by
Other tools in the package are described in the vignette "Explore UK Biobank Data"
vignette("explore-ukb-data", package = "ukbtools")
For a list of all functions
help(package = "ukbtools")
ukb_icd_freq_by: corrected order by levels of
reference.var in the optional plot. (order in the default dataframe returned was correct.)
ukb_df: corrected tab file path update in r source file. Specifically, made regular expression more specific (1 case reported of regular expression matching word elsewhere in the source file.). Also, replaced utils::read.delim with readr::read_tsv for faster read, with progress bar.
ukb_icd_freq_by returns frequency for one or more ICD diagnoses by levels of a reference variable and includes an optional plot
ukb_df_full_join (a thin wrapper around
dplyr::full_join) recursively called on a list of UKB datasets
ukb_df_duplicated_names to identify duplicated names within a dataset. The variable prefix (constructed from its description), index, and array should make the column name unique. However, typos in UKB documentation that give two variables the do not increment index/array have been observed. You will want to identify these and update them appropriately. We expect the occurrence of such duplicates will be rare.
ukb_icd_diagnosis now takes one or more individual ids and returns a dataframe with a potential message noting ids with no diagnoses
ukb_icd_keyword accepts a character vector of one or more "keywords" and returns all ICD descriptions including any of the keywords