Ordering of Steps
In recipes, there are no constraints related to the order in which steps are added to the recipe. However, there are some general suggestions that you should consider:
- If using a Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. Alternatively, use the Yeo-Johnson transformation so you don’t have to worry about this.
- Recipes do not automatically create dummy variables (unlike most formula methods). If you want to center, scale, or do any other operations on all of the predictors, run
step_dummy
first so that numeric columns are in the data set instead of factors.
- As noted in the help file for
step_interact
, you should make dummy variables before creating the interactions.
- If you are lumping infrequently categories together with
step_other
, call step_other
before step_dummy
.
While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:
- Impute
- Individual transformations for skewness and other issues
- Discretize (if needed and if you have no other choice)
- Create dummy variables
- Create interactions
- Normalization steps (center, scale, range, etc)
- Multivariate transformation (e.g. PCA, spatial sign, etc)
Again, your milage may vary for your particular problem.