An extensible framework to create and preprocess design matrices. Recipes consist of one or more data manipulation and analysis "steps". Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting design matrices can then be used as inputs into statistical or machine learning models.
recipes
0.1.2Edwin Thoen suggested adding validation checks for certain data characteristics. This fed into the existing notion of expanding recipes
beyond steps (see the non-step steps project). A new set of operations, called checks
, can now be used. These should throw an informative error when the check conditions are not met and return the existing data otherwise.
Steps now have a skip
option that will not apply preprocessing when bake
is used. See the article on skipping steps for more information.
check_missing
will validate that none of the specified variables contain missing data.step_num2factor
can be used to convert numeric data (especially integers) to factors.step_novel
adds a new factor level to nominal variables that will be used when new data contain a level that did not exist when the recipe was prepared.step_profile
can be used to generate design matrix grids for prediction profile plots of additive models where one variable is varied over a grid and all of the others are fixed at a single value.step_downsample
and step_upsample
can be used to change the number of rows in the data based on the frequency distributions of a factor variable in the training set. By default, this operation is only applied to the training set; bake
ignores this operation.step_spatialsign
now has the option of removing missing data prior to computing the norm.recipes
0.1.1bake
was changed from all_predictors()
to everything()
.verbose
option for prep
is now defaulted to FALSE
step_dummy
was fixed that makes sure that the correct binary variables are generated despite the levels or values of the incoming factor. Also, step_dummy
now requires factor inputs.step_dummy
also has a new default naming function that works better for factors. However, there is an extra argument (ordinal
) now to the functions that can be passed to step_dummy
.step_interact
now allows for selectors (e.g. all_predictors()
or starts_with("prefix")
to be used in the interaction formula.step_YeoJohnson
gained an na.rm
option.dplyr::one_of
was added to the list of selectors.step_bs
adds B-spline basis functions.step_unorder
converts ordered factors to unordered factors.step_count
counts the number of instances that a pattern exists in a string.step_string2factor
and step_factor2string
can be used to move between encodings.step_lowerimpute
is for numeric data where the values cannot be measured below a specific value. For these cases, random uniform values are used for the truncated values.step_zv
).tidy
methods were added for recipes and many (but not all) steps.bake.recipe
, the argument newdata
is now without a default.bake
and juice
can now save the final processed data set in sparse format. Note that, as the steps are processed, a non-sparse data frame is used to store the results.recipes
0.1.0First CRAN release.
prepare
to prep
per issue #59recipes
0.0.1.9003learn
has become prepare
and process
has become bake
recipes
0.0.1.9002New steps:
step_lincomb
removes variables involved in linear combinations to resolve them.step_bin2factor
)step_regex
applies a regular expression to a character or factor vector to create dummy variables.Other changes:
step_dummy
and step_interact
do a better job of respecting missing values in the data set.recipes
0.0.1.9001recipe
objects was changed so that pipes can be used to create the recipe with a formula.process.recipe
lost the role
argument in factor of a general set of selectors. If no selector is used, all the predictors are returned.