An extensible framework to create and preprocess design matrices. Recipes consist of one or more data manipulation and analysis "steps". Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting design matrices can then be used as inputs into statistical or machine learning models.
Edwin Thoen suggested adding validation checks for certain data characteristics. This fed into the existing notion of expanding
recipes beyond steps (see the non-step steps project). A new set of operations, called
checks, can now be used. These should throw an informative error when the check conditions are not met and return the existing data otherwise.
Steps now have a
skip option that will not apply preprocessing when
bake is used. See the article on skipping steps for more information.
check_missingwill validate that none of the specified variables contain missing data.
step_num2factorcan be used to convert numeric data (especially integers) to factors.
step_noveladds a new factor level to nominal variables that will be used when new data contain a level that did not exist when the recipe was prepared.
step_profilecan be used to generate design matrix grids for prediction profile plots of additive models where one variable is varied over a grid and all of the others are fixed at a single value.
step_upsamplecan be used to change the number of rows in the data based on the frequency distributions of a factor variable in the training set. By default, this operation is only applied to the training set;
bakeignores this operation.
step_spatialsignnow has the option of removing missing data prior to computing the norm.
bakewas changed from
prepis now defaulted to
step_dummywas fixed that makes sure that the correct binary variables are generated despite the levels or values of the incoming factor. Also,
step_dummynow requires factor inputs.
step_dummyalso has a new default naming function that works better for factors. However, there is an extra argument (
ordinal) now to the functions that can be passed to
step_interactnow allows for selectors (e.g.
starts_with("prefix")to be used in the interaction formula.
dplyr::one_ofwas added to the list of selectors.
step_bsadds B-spline basis functions.
step_unorderconverts ordered factors to unordered factors.
step_countcounts the number of instances that a pattern exists in a string.
step_factor2stringcan be used to move between encodings.
step_lowerimputeis for numeric data where the values cannot be measured below a specific value. For these cases, random uniform values are used for the truncated values.
tidymethods were added for recipes and many (but not all) steps.
bake.recipe, the argument
newdatais now without a default.
juicecan now save the final processed data set in sparse format. Note that, as the steps are processed, a non-sparse data frame is used to store the results.
First CRAN release.
prepper issue #59
step_lincombremoves variables involved in linear combinations to resolve them.
step_regexapplies a regular expression to a character or factor vector to create dummy variables.
step_interactdo a better job of respecting missing values in the data set.
recipeobjects was changed so that pipes can be used to create the recipe with a formula.
roleargument in factor of a general set of selectors. If no selector is used, all the predictors are returned.