step_filter()
, step_slice()
, step_sample()
, and step_naomit()
had their defaults for skip
changed to TRUE
. In the vast majority of applications, these steps should not be applied to the test or assessment sets.
tidyr
version 1.0.0 or later is now required.
step_pls()
was changed so that it uses the Bioconductor mixOmics package. Objects created with previous versions of recipes
can still use juice()
and bake()
. With the current version, the categorical outcomes can be used but now multivariate models do not. Also, the new method allows for sparse results.
As suggested by @StefanBRas, step_ica()
now defaults to the C engine (#518)
Avoided partial matching on seq()
arguments in internal functions.
Improved error messaging, for example when a user tries to prep()
a tuneable recipe.
step_upsample()
and step_downsample()
are soft deprecated in recipes as they are now available in the themis package. They will be removed in the next version.
step_zv()
now handles NA
values so that variables with zero variance plus are removed.
The selectors all_of()
and any_of()
can now be used in step selections (#477).
The tune
pacakge can now use recipes with check
operations (but also requires tune
>= 0.1.0.9000).
The tidy
method for step_pca()
now has an option for returning the variance statistics for each component.
recipes
does not directly depend on dials
, it has several S3 methods for generics in dials
. Version 0.0.5 of dials
added stricter validation for these methods, so changes were required for recipes
.step_cut()
enables you to create a factor from a numeric based on provided break (contributed by Edwin Thoen)yj_trans()
to yj_transform()
to avoid conflicts.Added flexible naming options for new columns created by step_depth()
and step_classdist()
(#262).
Small changes for base R’s stringsAsFactors
change.
Delayed S3 method registration for tune::tunable()
methods that live in recipes will now work correctly on R >=4.0.0 (#439, tidymodels/tune#146).
step_relevel()
added.
recipes
0.1.8The imputation steps do not change the data type being imputed now. Previously, if the data were integer, the data would be changed to numeric (for some step types). The change is breaking since the underlying data of imputed values are now saved as a list instead of a vector (for some step types).
The data sets were moved to the new modeldata
package.
step_num2factor()
was rewritten due to a bug that ignored the user-supplied levels (#425). The results of the transform
argument are now required to be a function and levels
must now be supplied.
Using a minus in the formula to recipes()
is no longer allowed (it didn’t remove variables anyway). step_rm()
or update_role()
can be used instead.
When using a selector that returns no columns, juice()
and bake()
will now return a tibble with as many rows as the original template data or the new_data
respectively. This is more consistent with how selectors work in dplyr (#411).
Code was added to explicitly register tunable
methods when recipes
is loaded. This is required because of changes occurring in R 4.0.
check_class()
checks if a variable is of the designated class. Class is either learned from the train set or provided in the check. (contributed by Edwin Thoen)
step_normalize()
and step_scale()
gained a factor
argument with values of 1 or 2 that can scale the standard deviations used to transform the data. (#380)
bake()
now produces a tibble with columns in the same order as juice()
(#365)
recipes
0.1.7Release driven by changes in tidyr
(v 1.0.0).
format_selector()
’s wdth
argument has been renamed to width
(#250).
step_mutate_at()
, step_rename()
, and step_rename_at()
were added.The use of varying()
will be deprecated in favor of an upcoming function tune()
. No changes are need in this version, but subsequent versions will work with tune()
.
format_ch_vec()
and format_selector()
are now exported (#250).
check_new_values
breaks bake
if variable contains values that were not observed in the train set (contributed by Edwin Thoen)
When no outcomes are in the recipe, using juice(object, all_outcomes()
and bake(object, new_data, all_outcomes()
will return a tibble with zero rows and zero columns (instead of failing). (#298). This will also occur when the selectors select no columns.
As alternatives to step_kpca()
, two separate steps were added called step_kpca_rbf()
and step_kpca_poly()
. The use of step_kpca()
will print a deprecation message that it will be going away.
step_nzv()
and step_poly()
had arguments promoted out of their options
slot. options
can be used in the short term but is deprecated.
step_downsample()
will replace the ratio
argument with under_ratio
and step_upsample()
will replace it with over_ratio
. ratio
still works (for now) but issues a deprecation message.
step_discretize()
has arguments moved out of options
too; the main arguments are now num_breaks
(instead of cuts
) and min_unique
. Again, deprecation messages are issued with the old argument structure.
Models using the dimRed
package (step_kpca()
, step_isomap()
, and step_nnmf()
) would silently fail if the projection method failed. An error is issued now.
Methods were added for a future generic called tunable()
. This outlines which parameters in a step can/could be tuned.
recipes
0.1.6Release driven by changes in rlang
.
Since 2018, a warning has been issued when the wrong argument was used in bake(recipe, newdata)
. The depredation period is over and new_data
is officially required.
Previously, if step_other()
did not collapse any levels, it would still add an “other” level to the factor. This would lump new factor levels into “other” when data were baked (as step_novel()
does). This no longer occurs since it was inconsistent with ?step_other
, which said that
“If no pooling is done the data are unmodified”.
step_normalize()
centers and scales the data (if you are, like Max, too lazy to use two separate steps).step_unknown()
will convert missing data in categorical columns to “unknown” and update factor levels.If threshold
argument of step_other
is greater than one then it specifies the minimum sample size before the levels of the factor are collapsed into the “other” category. #289
step_knnimpute()
can now pass two options to the underlying knn code, including the number of threads (#323).
Due to changes by CRAN, step_nnmf()
only works on versions of R >= 3.6.0 due to dependency issues.
step_dummy()
and step_other()
are now tolerant to cases where that step’s selectors do not capture any columns. In this case, no modifications to the data are made. (#290, #348)
step_dummy()
can now retain the original columns that are used to make the dummy variables. (#328)
step_other()
’s print method only reports the variables with collapsed levels (as opposed to any column that was tested to see if it needed collapsing). (#338)
step_pca()
, step_kpca()
, step_ica()
, step_nnmf()
, step_pls()
, and step_isomap()
now accept zero components. In this case, the original data are returned.
recipes
0.1.5Small release driven by changes in sample()
in the current r-devel.
A new vignette discussing roles has been added.
To provide infrastructure for finalizing varying parameters, an update()
method for recipe steps has been added. This allows users to alter information in steps that have not yet been trained.
step_interact
will no longer fail if an interaction contains an interaction using column that has been previously filtered from the data. A warning is issued when this happens and no interaction terms will be created.
step_corr
was made more fault tolerant for cases where the data contain a zero-variance column or columns with missing values.
Set the embedded environment to NULL in prep.step_dummy
to reduce the file size of serialized recipe class objects when using saveRDS
.
tidy
method for step_dummy
now returns the original variable and the levels of the future dummy variables.NA
roles of existing columns (#296).recipes
0.1.4Several argument names were changed to be consistent with other tidymodels
packages (e.g. dials
) and the general tidyverse naming conventions.
K
in step_knnimpute
was changed to neighbors
. step_isomap
had the number of neighbors promoted to a main argument called neighbors
step_pca
, step_pls
, step_kpca
, step_ica
now use num_comp
instead of num
. , step_isomap
uses num_terms
instead of num
.step_bagimpute
moved nbagg
out of the options and into a main argument trees
.step_bs
and step_ns
has degrees of freedom promoted to a main argument with name deg_free
. Also, step_bs
had degree
promoted to a main argument.step_BoxCox
and step_YeoJohnson
had nunique
change to num_unique
.bake
, juice
and other functions has newdata
changed to new_data
. For this version only, using newdata
will only result in a wanring.na.rm
changed to na_rm
.prep
and a few steps had stringsAsFactors
changed to strings_as_factors
.add_role()
can now only add new additional roles. To alter existing roles, use update_role()
. This change also allows for the possibility of having multiple roles/types for one variable. #221
All steps gain an id
field that will be used in the future to reference other steps.
The retain
option to prep
is now defaulted to TRUE
. If verbose = TRUE
, the approximate size of the data set is printed. #207
step_integer
converts data to ordered integers similar to LabelEncoder
#123 and #185step_geodist
can be used to calculate the distance between geocodes and a single reference location.step_arrange
, step_filter
, step_mutate
, step_sample
, and step_slice
implement their dplyr
analogs.step_nnmf
computes the non-negative matrix factorization for data.rsample
function prepper
was moved to recipes
(issue).step_step_string2factor
will now accept factors and leave them as-is.step_knnimpute
now excludes missing data in the variable to be imputed from the nearest-neighbor calculation. This would have resulted in some missing data to not be imputed (i.e. return another missing value).step_dummy
now produces a warning (instead of failing) when non-factor columns are selected. Only factor columns are used; no conversion is done for character data. issue #186dummy_names
gained a separator argument. issue #183step_downsample
and step_upsample
now have seed
arguments for more control over randomness.broom
is no longer used to get the tidy
generic. These are now contained in the generics
package.recipes
0.1.3check_range
breaks bake
if variable range in new data is outside the range that was learned from the train set (contributed by Edwin Thoen)
step_lag
can lag variables in the data set (contributed by Alex Hayes).
step_naomit
removes rows with missing data for specific columns (contributed by Alex Hayes).
step_rollimpute
can be used to impute data in a sequence or series by estimating their values within a moving window.
step_pls
can conduct supervised feature extraction for predictors.
step_log
gained an offset
argument.
step_log
gained a signed
argument (contributed by Edwin Thoen).
The internal functions sel2char
and printer
have been exported to enable other packages to contain steps.
When training new steps after some steps have been previously trained, the retain = TRUE
option should be set on previous invocations of prep
.
For step_dummy
:
one_hot = TRUE
option. Thanks to Davis Vaughan.contrast
option was removed. The step uses the global option for contrasts.step_other
will now convert novel levels of the factor to the “other” level.
step_bin2factor
now has an option to choose how the values are translated to the levels (contributed by Michael Levy).
bake
and juice
can now export basic data frames.
The okc
data were updated with two additional columns.
issue 125 that prevented several steps from working with dplyr grouped data frames. (contributed by Jeffrey Arnold)
issue 127 where options to step_discretize
were not being passed to discretize
.
recipes
0.1.2Edwin Thoen suggested adding validation checks for certain data characteristics. This fed into the existing notion of expanding recipes
beyond steps (see the non-step steps project). A new set of operations, called checks
, can now be used. These should throw an informative error when the check conditions are not met and return the existing data otherwise.
Steps now have a skip
option that will not apply preprocessing when bake
is used. See the article on skipping steps for more information.
check_missing
will validate that none of the specified variables contain missing data.
detect_step
can be used to check if a recipe contains a particular preprocessing operation.
step_num2factor
can be used to convert numeric data (especially integers) to factors.
step_novel
adds a new factor level to nominal variables that will be used when new data contain a level that did not exist when the recipe was prepared.
step_profile
can be used to generate design matrix grids for prediction profile plots of additive models where one variable is varied over a grid and all of the others are fixed at a single value.
step_downsample
and step_upsample
can be used to change the number of rows in the data based on the frequency distributions of a factor variable in the training set. By default, this operation is only applied to the training set; bake
ignores this operation.
step_naomit
drops rows when specified columns contain NA
, similar to tidyr::drop_na
.
step_lag
allows for the creation of lagged predictor columns.
step_spatialsign
now has the option of removing missing data prior to computing the norm.recipes
0.1.1bake
was changed from all_predictors()
to everything()
.verbose
option for prep
is now defaulted to FALSE
step_dummy
was fixed that makes sure that the correct binary variables are generated despite the levels or values of the incoming factor. Also, step_dummy
now requires factor inputs.step_dummy
also has a new default naming function that works better for factors. However, there is an extra argument (ordinal
) now to the functions that can be passed to step_dummy
.step_interact
now allows for selectors (e.g. all_predictors()
or starts_with("prefix")
to be used in the interaction formula.step_YeoJohnson
gained an na.rm
option.dplyr::one_of
was added to the list of selectors.step_bs
adds B-spline basis functions.step_unorder
converts ordered factors to unordered factors.step_count
counts the number of instances that a pattern exists in a string.step_string2factor
and step_factor2string
can be used to move between encodings.step_lowerimpute
is for numeric data where the values cannot be measured below a specific value. For these cases, random uniform values are used for the truncated values.step_zv
).tidy
methods were added for recipes and many (but not all) steps.bake.recipe
, the argument newdata
is now without a default.bake
and juice
can now save the final processed data set in sparse format. Note that, as the steps are processed, a non-sparse data frame is used to store the results.recipes
0.1.0First CRAN release.
prepare
to prep
per issue #59recipes
0.0.1.9003learn
has become prepare
and process
has become bake
recipes
0.0.1.9002step_lincomb
removes variables involved in linear combinations to resolve them.step_bin2factor
)step_regex
applies a regular expression to a character or factor vector to create dummy variables.step_dummy
and step_interact
do a better job of respecting missing values in the data set.recipes
0.0.1.9001recipe
objects was changed so that pipes can be used to create the recipe with a formula.process.recipe
lost the role
argument in factor of a general set of selectors. If no selector is used, all the predictors are returned.