A set of tools to create a UK Biobank < http://www.ukbiobank.ac.uk/> dataset from a UKB fileset (.tab, .r, .html), visualize primary demographic data for a sample subset, query ICD diagnoses, retrieve genetic metadata, read and write standard file formats for genetic analyses.
After downloading and decrypting your UK Biobank (UKB) data with the supplied UKB programs, you have multiple files that need to be brought together to give you a dataset to explore. The data file has column names that are edited field-codes from the UKB data showcase. ukbtools makes it easy to collapse the multiple UKB files into a single dataset for analysis, in the process giving meaningful names to the variables. The package also includes functionality to retrieve ICD diagnoses, explore a sample subset in the context of the UKB sample, and collect genetic metadata.
# Install from CRANinstall.packages("ukbtools")# Install latest development versiondevtools::install_github("kenhanscombe/ukbtools", dependencies = TRUE)
Download§ then decrypt your data and create a "UKB fileset" (.tab, .r, .html):
ukb_unpack ukbxxxx.enc keyukb_conv ukbxxxx.enc_ukb rukb_conv ukbxxxx.enc_ukb docs
ukb_unpack decrypts your downloaded
ukbxxxx.enc file, outputting a
ukb_conv with the
r flag converts the decrypted data to a tab-delimited file
ukbxxxx.tab and an R script
ukbxxxx.r that reads the tab file. The
docs flag creates an html file containing a field-code-to-description table (among others).
§ Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation.
ukb_df() takes two arguments, the stem of your fileset and the path, and returns a dataframe with usable column names. This will take a few minutes. The rate-limiting step is reading and parsing the code in the UKB-generated .r file - not
ukb_df per se.
library(ukbtools)my_ukb_data <- ukb_df("ukbxxxx")
You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data
my_ukb_data <- ukb_df("ukbxxxx", path = "/full/path/to/my/data")
Note: You can move the three files in your fileset after creating them with
ukb_conv, but they should be kept together.
ukb_df() automatically updates the read call in the R source file to point to the correct directory (the current directory by default, or a directory specified by
All tools are described on the ukbtools webpage and in the package vignette "Explore UK Biobank Data"
vignette("explore-ukb-data", package = "ukbtools")
For a list of all functions
help(package = "ukbtools")
Added example UKB data ukbXXXX.tab, ukbXXXX.r, ukbXXXX.html to test the 'read'
and 'summarise' functionality
the section "An example fileset" in the vignette for details.
freq.plot = TRUE plots a barplot for categorical
reference variables, and plots diagnosis frequencies at the midpoint of each
group for quatitative reference variables.
The ukbtools webpage has been rebuilt with pkgdown and includes the vignette under the Articles tab.
data.table::freadfor faster read. Also includes an
n_threadsargument passed to
data.table::fread, which may make read faster. Column names now include field code to ensure names are unique (UK Biobank sometimes use the same description for more than one variable)
ukb_defunctexplains why these have become defunct and where to get UK Biobank genetic (meta)data.
ukb_gen_sqc_namessupplies column names for the separately downloaded sample QC file;
ukb_gen_rel_countdoes the same as before (a count of levels of relatedness or a plot) but with separately downloaded relatedness data;
ukb_gen_related_with_datareturns subset of relatedness data in which both IDs have data on a phenotype of interest;
ukb_gen_samples_to_removereturns a list of individuals to exclude in order to remove relatedness (one possible solution to a maximal subset problem).
ukb_icd_freq_by: corrected order by levels of
reference.var in the optional plot. (order in the default dataframe returned was correct.)
ukb_df: corrected tab file path update in r source file. Specifically, made regular expression more specific (1 case reported of regular expression matching word elsewhere in the source file.). Also, replaced utils::read.delim with readr::read_tsv for faster read, with progress bar.
ukb_icd_freq_by returns frequency for one or more ICD diagnoses by levels of a reference variable and includes an optional plot
ukb_df_full_join (a thin wrapper around
dplyr::full_join) recursively called on a list of UKB datasets
ukb_df_duplicated_names to identify duplicated names within a dataset. The variable prefix (constructed from its description), index, and array should make the column name unique. However, typos in UKB documentation that give two variables the do not increment index/array have been observed. You will want to identify these and update them appropriately. We expect the occurrence of such duplicates will be rare.
ukb_icd_diagnosis now takes one or more individual ids and returns a dataframe with a potential message noting ids with no diagnoses
ukb_icd_keyword accepts a character vector of one or more "keywords" and returns all ICD descriptions including any of the keywords