A Suite of Packages for Analysis of Big Genomic Data

An umbrella package providing a phenotype/genotype data structure and scalable and efficient computational methods for large genomic datasets in combination with several other packages: 'BEDMatrix', 'LinkedMatrix', and 'symDMatrix'.


CRAN_Status_Badge Rdoc Travis-CI Build Status AppVeyor Build Status Coverage status

BGData is an R package that provides scalable and efficient computational methods for large genomic datasets, e.g., genome-wide association studies (GWAS) or genomic relationship matrices (G matrices). It also contains a data structure called BGData that holds genotypes in the @geno slot, phenotypes in the @pheno slot, and additional information in the @map slot.

Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.

We have identified several approaches to tackle those challenges within R:

  • File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
  • Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
  • Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
  • Multi-level parallelism: Exploit multi-core and multi-node computing.
  • Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).

The BGData package is an umbrella package that comprises several packages: BEDMatrix, LinkedMatrix, and symDMatrix.

Examples

Load the BGData package:

library(BGData)

Inspecting the example dataset

The inst/extdata folder contains example files that were generated from the 250k SNP and phenotype data in Atwell et al. (2010). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small enough for CRAN. PLINK was used to convert the data to .bed and .raw files. FT10 has been chosen as a phenotype and is provided as an alternate phenotype file. The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.

path <- system.file("extdata", package = "BGData")
list.files(path)
#>  [1] "chr1.bed"  "chr1.bim"  "chr1.fam"  "chr1.raw"  "chr2.bed"  "chr2.bim"
#>  [7] "chr2.fam"  "chr2.raw"  "chr3.bed"  "chr3.bim"  "chr3.fam"  "chr3.raw"
#> [13] "pheno.txt"

Loading example dataset

Loading individual PLINK .bed files

Load the .bed file for chromosome 1 (chr1.bed) using the BEDMatrix package:

chr1 <- BEDMatrix(paste0(path, "/chr1.bed"))
#> Extracting number of individuals and rownames from .fam file...
#> Extracting number of markers and colnames from .bim file...

BEDMatrix objects behave similarly to regular matrices:

dim(chr1)
#> [1] 199 300
rownames(chr1)[1:10]
#> [1] "5837_5837" "6008_6008" "6009_6009" "6016_6016" "6040_6040" "6042_6042"
#> [7] "6043_6043" "6046_6046" "6064_6064" "6074_6074"
colnames(chr1)[1:10]
#> [1] "snp1_T"  "snp2_G"  "snp3_A"  "snp4_T"  "snp5_G"  "snp6_T"  "snp7_C"
#> [8] "snp8_C"  "snp9_C"  "snp10_G"
chr1["6008_6008", "snp5_G"]
#> [1] 0

Linking multiple BEDMatrix objects together

Load the other two .bed files:

chr2 <- BEDMatrix(paste0(path, "/chr2.bed"))
#> Extracting number of individuals and rownames from .fam file...
#> Extracting number of markers and colnames from .bim file...
chr3 <- BEDMatrix(paste0(path, "/chr3.bed"))
#> Extracting number of individuals and rownames from .fam file...
#> Extracting number of markers and colnames from .bim file...

Combine the BEDMatrix objects by columns using the LinkedMatrix to avoid the inconvenience of having three separate matrices:

wg <- ColumnLinkedMatrix(chr1, chr2, chr3)

Just like BEDMatrix objects, LinkedMatrix objects also behave similarly to regular matrices:

dim(wg)
#> [1] 199 900
rownames(wg)[1:10]
#> [1] "5837_5837" "6008_6008" "6009_6009" "6016_6016" "6040_6040" "6042_6042"
#> [7] "6043_6043" "6046_6046" "6064_6064" "6074_6074"
colnames(wg)[1:10]
#> [1] "snp1_T"  "snp2_G"  "snp3_A"  "snp4_T"  "snp5_G"  "snp6_T"  "snp7_C"
#> [8] "snp8_C"  "snp9_C"  "snp10_G"
wg["6008_6008", "snp5_G"]
#> [1] 0

Creating a BGData object

BGData objects can be created from individual BEDMatrix objects or a collection of BEDMatrix objects as a LinkedMatrix object using the as.BGData() function. This will read the .fam and .bim file that comes with the .bed files. The alternatePhenotypeFile parameter points to the file that contains the FT10 phenotype:

bg <- as.BGData(wg, alternatePhenotypeFile = paste0(path, "/pheno.txt"))
#> Extracting phenotypes from .fam file, assuming that the .fam file of the first BEDMatrix instance is representative of all the other nodes...
#> Extracting map from .bim files...
#> Merging alternate phenotype file...

The bg object will have the LinkedMatrix object in the @geno slot, the .fam file augmented by the FT10 phenotype in the @pheno slot, and the .bim file in the @map slot.

str(bg)
#> Formal class 'BGData' [package "BGData"] with 3 slots
#>   [email protected] geno :Formal class 'ColumnLinkedMatrix' [package "LinkedMatrix"] with 1 slot
#>   .. .. [email protected] .Data:List of 3
#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr1.bed]
#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr2.bed]
#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr3.bed]
#>   [email protected] pheno:'data.frame':       199 obs. of  7 variables:
#>   .. ..$ FID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...
#>   .. ..$ IID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...
#>   .. ..$ PAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ MAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ SEX      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ PHENOTYPE: int [1:199] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
#>   .. ..$ FT10     : num [1:199] 57 60 98 75 71 56 90 93 96 91 ...
#>   [email protected] map  :'data.frame':       900 obs. of  6 variables:
#>   .. ..$ chromosome        : int [1:900] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..$ snp_id            : chr [1:900] "snp1" "snp2" "snp3" "snp4" ...
#>   .. ..$ genetic_distance  : int [1:900] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ base_pair_position: int [1:900] 657 3102 4648 4880 5975 6063 6449 6514 6603 6768 ...
#>   .. ..$ allele_1          : chr [1:900] "T" "G" "A" "T" ...
#>   .. ..$ allele_2          : chr [1:900] "C" "A" "C" "C" ...

Saving a BGData object

A BGData object can be saved like any other R object using the save function:

save(bg, file = "BGData.RData")

Loading a BGData object

The genotypes in a BGData object can be of various types, some of which need to be initialized in a particular way. The load.BGData takes care of reloading a saved BGData object properly:

load.BGData("BGData.RData")
#> Loaded objects: bg

Summarizing data

Use chunkedApply to count missing values (among others):

countNAs <- chunkedApply(X = [email protected], MARGIN = 2, FUN = function(x) sum(is.na(x)))

Use the summarize function to calculate minor allele frequencies and frequency of missing values:

Running GWASes with different regression methods

A data structure for genomic data is useful when defining methods that act on both phenotype and genotype information. We have implemented a GWAS function that supports various regression methods. The formula takes phenotypes from @pheno and inserts one marker at a time.

gwas <- GWAS(formula = FT10 ~ 1, data = bg)

Generating the G Matrix

G <- getG([email protected])

Installation

Install the stable version from CRAN:

install.packages("BGData")

Alternatively, install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("QuantGen/BGData")

Contribute

Documentation

Further documentation can be found on RDocumentation.

News

BGData 2.1.0

  • Add chunkedMap() function.
  • Improve error handling in chunkedMap() and chunkedApply().
  • summarize(): Improve performance.
  • GWAS(): Improve performance of rayOLS method.
  • GWAS(): Fix bug when computing p-values for methods other than rayOLS, lsfit, or SKAT when i is used to subset samples.
  • GWAS(): Fix wrong results in lsfit method when covariates with missing values are used.
  • as.BGData(): Fix bug loading .fam and .bim files when path contains the word bed.

BGData 2.0.0

Breaking Changes

  • Rename bufferSize to chunkSize.
  • Remove nTasks parameter from chunkedApply() and methods based on it.
  • Remove crossprods function.

Other Changes

  • Change chunking strategy to improve parallelism: instead of loading a subset of chunkSize in the main process, load a subset of chunkSize in the each fork. That way nTasks is not necessary anymore and the same code can be used for one core and multiple cores.
  • Add findRelated() function for use with matrices and symDMatrix objects.
  • Add orderedMerge() function that allows for phenotypes to be easily merged into a BGData object.
  • Performance improvements in getG() function: use single shared memory matrix to collect results.
  • Performance improvements in rayOLS method in GWAS() function.
  • getG_symDMatrix(): Support version 2 of symDMatrix package.
  • getG_symDMatrix(): Add chunkSize parameter.
  • getG_symDMatrix(): Add minVar parameter.
  • as.BGData(): Use rownames of BEDMatrix object as rownames for pheno, and colnames of BEDMatrix object as rownames for map.
  • Include process ID in verbose output if nCores > 1.

Bug Fixes

  • getG_symDMatrix(): Fix scaling error when scale = FALSE.
  • getG_symDMatrix(): Compute block indices correctly for out-of-order, non-sequential indices.
  • getG_symDMatrix(): Do not include centers and scales in attributes anymore because the influence of j and minVar is difficult to retain.

BGData 1.0.0

Initial release.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("BGData")

2.1.0 by Alexander Grueneberg, 2 months ago


https://github.com/QuantGen/BGData


Report a bug at https://github.com/QuantGen/BGData/issues


Browse source code at https://github.com/cran/BGData


Authors: Gustavo de los Campos [aut] , Alexander Grueneberg [aut, cre] , Paulino Perez [ctb] , Ana Vazquez [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports methods, parallel, crochet, bigmemory, synchronicity, ff, bit

Depends on BEDMatrix, LinkedMatrix, symDMatrix

Suggests data.table, lme4, SKAT, testthat, covr


Suggested by LinkedMatrix, symDMatrix.


See at CRAN