Extend lasso and elastic-net model fitting for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into memory. It's much more memory- and computation-efficient as compared to existing lasso-fitting packages like 'glmnet' and 'ncvreg', thus allowing for very powerful big data analysis even with an ordinary laptop.
biglasso extends lasso and elastic-net linear and logistic regression models for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into memory. It utilizes memory-mapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory- and computation-efficient and highly scalable as compared to existing lasso-fitting packages such as glmnet and ncvreg. Bechmarking experiments using both simulated and real data sets show that
biglasso is not only 1.5x to 4x times faster than existing packages, but also at least 2x more memory-efficient. More importantly, to the best of our knowledge,
biglasso is the first R package that enables users to fit lasso models with data sets that are larger than available RAM, thus allowing for powerful big data analysis on an ordinary laptop.
biglassoat least 2x more memory-efficient than
screen = "SSR-BEDPP"),
ncvreg (3.7-0), and
lambda / lambda_maxfrom 0.1 to 1; varying number of observations
nand number of features
p; 20 replications, the mean (SE) computing time (in seconds) are reported.
y = X * beta + 0.1 eps, where
epsare i.i.d. sampled from
In all the settings,
biglasso (1 core) is uniformly 2x faster than
ncvreg, and 2.5x faster than
picasso. Moreover, the computing time of
biglasso can be further reduced by half via parallel-computation of 4 cores.
To prove that
biglasso is much more memory-efficient, we simulate a
1000 X 100000 large feature matrix. The raw data is 0.75 GB. We used Syrupy to measure the memory used in RAM (i.e. the resident set size, RSS) every 1 second during lasso model fitting by each of the packages.
The maximum RSS (in GB) used by a single fit and 10-fold cross validation is reported in the Table below. In the single fit case,
biglasso consumes 0.84 GB memory in RAM, 50% of that used by
glmnet and 22% of that used by
picasso. Note that the memory consumed by
picasso are respectively 2.2x, 2.1x, and 5.1x larger than the size of the raw data. More strikingly,
biglasso does not require additional memory to perform cross-validation, unlike other packages. For serial 10-fold cross-validation,
biglasso requires just 27% of the memory used by
glmnet and 23% of that used by
ncvreg, making it 3.6x and 4.3x more memory-efficient compared to these two, respectively.
..* the memory savings offered by
biglasso would be even more significant if cross-validation were conducted in parallel. However, measuring memory usage across parallel processes is not straightforward and not implemented in
..* cross-validation is not implemented in
picasso at this point.
The performance of the packages are also tested using diverse real data sets:
The following table summarizes the mean (SE) computing time (in seconds) of solving the lasso along the entire path of 100
lambda values equally spaced on the scale of
lambda / lambda_max from 0.1 to 1 over 20 replications.
|picasso||1.50 (0.01)||6.86 (0.06)||34.00 (0.47)||44.24 (0.46)|
|ncvreg||1.14 (0.02)||5.60 (0.06)||31.55 (0.18)||32.78 (0.10)|
|glmnet||1.02 (0.01)||5.63 (0.05)||23.23 (0.19)||33.38 (0.08)|
|biglasso||0.54 (0.01)||1.48 (0.10)||17.17 (0.11)||14.35 (1.29)|
To demonstrate the out-of-core computing capability of
biglasso, a 31 GB real data set from a large-scale genome-wide association study is analyzed. The dimensionality of the design matrix is:
n = 2898, p = 1,339,511. Note that the size of data is nearly 2x larger than the installed 16 GB of RAM.
Since other three packages cannot handle this data-larger-than-RAM case, we compare the performance of screening rules
SSR-BEDPP based on our package
biglasso. In addition, two cases in terms of
lambda_min are considered: (1)
lam_min = 0.1 lam_max; and (2)
lam_min = 0.5 lam_max, as in practice there is typically less interest in lower values of
lambdafor very high-dimensional data such as this case. Again the entire solution path with 100
lambda values is obtained. The table below summarizes the overall computing time (in minutes) by screening rule
SSR (which is what other three packages are using) and our new rule
SSR-BEDPP. (Only 1 trial is conducted.)
dfmaxoption. (thanks you Florian Privé!)
1.2-6 (12/15/2016) ** added a newly proposed screening rule, SSR-Slores, for lasso-penalized logistic regression. ** added SSR-BEDPP for elastic-net-penalized linear regression.
1.2-1 (10/18/2016) ** Implemented another new screening rule (SSR-BEDPP), also combining hybrid strong rule with a safe rule (BEDPP). ** implemented EDPP rule with active set cycling strategy for linear regression.
1.1-1 (8/31/2016) ** Three sparse screening rules (SSR, EDPP, SSR-Dome) were implemented. Our new proposed HSR-Dome combines HSR and Dome test for feature screening, leading to even better performance as compared to 'glmnet'. ** OpenMP parallel computing was added to speedup single model fitting. ** Both exact Newton and majorization-minimization (MM) algorithm for logistic regression were implemented. The latter could be faster, especially in data-larger-than-RAM cases. ** Source code were rewritten in pure cpp.
1.0-1 (3/1/2016) ** package ready for CRAN submission.