Subsemble is a general subset ensemble prediction method, which can be used for small, moderate, or large datasets. Subsemble partitions the full dataset into subsets of observations, fits a specified underlying algorithm on each subset, and uses a unique form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. An oracle result provides a theoretical performance guarantee for Subsemble.
Notice to users: The interface (function arguments/values) may be subject to change prior to version 1.0.0.
"SL.gbm"is part of the library. This is because the
"SL.gbm"function uses multicore parallelization by default and R gets upset when you have two levels of multicore parallelism. This may be fixed in a future release. This is the error:
Error in checkForRemoteErrors(val) : 7 nodes produced errors; first error: cannot open the connection
SL.glm. This occurs in the internal CV step when the validation set has new levels that are not in the training set. These can be fixed by updating the wrapper functions.
methodfunction type from the
SuperLearner::SuperLearnerfunction as an option for the
cvRiskto subsemble output. Currently
metalearnerarguments using the
.check.SL.libraryfunction from the SL package or write a validation function from scratch.
README.mdfile with an overview and examples.
subControl$supervised = "SRT"or something similar).
subsetsargument and reserve that for specific subset (row index) lists only. It might make more sense to just set the number of subsets via the
genControl$saveFits=FALSE. Currently, an unclear error message comes up:
No applicable method for 'predict' applied to an object of class "NULL"
NULL. We could remove the
seedargument and create
subsembleinternals to accept wrappers using both the
function(Y, X, newX)notation (from the SuperLearner package) and translate the functions to the
function(x, y, newx)notation.
learneras a list instead of a vector.
.fitWrapperto avoid naming conflicts with parSapply functions. This was a bug introduced by the previous argument name change from
X, Y, newXarguments of the
x, y, newxin order to conform to more common ML algorthim conventions in R. Additionally, the order of the first two arguments was changed from
(x, y)to match common convention. Lastly, the
newdataargument in the
predict.subsemblefunction was changed to
predictfunction expects the new data.frame/matrix to have the same structure as the design matrix,
x, hence it is more descriptive to name this argument
runtimeelement to the
subsembleoutput which records the execution times of various steps in the algorithm.
control.Rfunctions to internal functions.
learner = c("SL.randomForest","SL.randomForest","SL.glm").
.make_Z_lsince it operates on a single learner only.
Lin multiple places in the
stratifyCV = FALSEfor both
subControlwhen the outcome is not binary.
parallelargument of the
predict.subsemblecode to force individual learner predictions into a vector. The
SL.earthwrapper created errors because it was returning a 1-column matrix instead of a vector. Also modified the function to assign column names the
subpreddata.frame after inserting the predictions instead of before inserting the predictions.
genControlargument to the
subsemblefunction to allow the user to not have to save all the model fits for a one-off type analysis. This also required adding a
gen_controlfunction to the
control.Rscript and modifying some internal
parallelargument in the
subsemblefunction from a logical argument to a string equal to
"seq"(the default) or
subsemblefunction, changed the name of the
subCVsets, which matches the output list name.
subsemble.predictoutput elements of the
pred, respectively. Also updated the
sublearner.predictoutput element of the
subpred. These were previously named to match the
SuperLearneroutput conventions, but now they are nicer and easier to type.
subsemblefunction because it made more sense to return an output object named
subfitsso it matches
metafit. Also required updating the
.predFunsince it was not being used for anything.
mto be consistent with the notation in the technical paper.
subsemblefunction documentation. (Might need to update when the
multiTypeoption is added.)
.subFunthat happens when
row.names(X)is not 1:N. Fixed by simply re-assigning
row.names(X)as 1:N. Fixed:
Error in FUN(1:3[[1L]], ...) : names not identical
learn_controlfunctions and added a
learnControlargument to the
subsemblefunction. Also renamed the
learn_controland implemented the
"crossprod"type of library expansion.
subControland added a
supervision=NULLparameter. The latter is a placeholder for the Supervised Regression Tree subsemble, which is not implemented yet.
subsemble.CV.control), since it's not being used.
seedwas not being set. Also added
set.seed(seed)inside functions (with inherent randomness, like training functions) that were being applied in parallel to ensure that the results would be the same regardless of whether the operations were being run in parallel or not.
xmatto avoid conflict with internal
.makeZinternal function in order to more easily switch between
parSapplyin order to match output to
sapply. This was causing the following error:
"Error in cvRes[j, ] (from #145) : incorrect number of dimensions"
cv_controlsince it's not being used.
idargument in the
subsemblefunction to later on in the argument list since it's not a commonly used argument.
metalearner = "SL.knn":
Error in as.matrix(train) : argument "X" is missing, with no default.
predict(object, newdata)was not working for
learnerdefault argument was listed as
Xrow names) to the
Zdata.frame, a list element of the
subsemblepackage since it is already loaded as a dependency.
SuperLearner.CV.controlfunction in the SuperLearner package. Replaced with modified version of the function called,
subsemble.CV.control. This updated version of the
SuperLearner.CV.controlfunction overrides the
validRowsargument (to be
learnerargument can be a character vector with a length that is a divisor of J.
metalearnerto be any general learner available in the SuperLearner candidate learner API, such as
"SL.randomForest". Previously, the metalearner was just a glm.
learnermust be a string must be like
"SL\\.[0-9a-zA_Z]+". Now validating the
metalearnerarguments by using
subsemblefunction that caused errors when learners returned unlabeled predictions (missing row index name), as well as when they returned a 1-col data.frame instead of a vector for the pred object. The former occured with
learner = c("SL.bart")and the latter occured with
learner = c("SL.glmnet").