Introduction to Formulaic

Authors: David Shilane, Anderson Nelson, Caffrey Lee and Zoe Huang

2020-05-04

Introduction

Across a wide variety of statistical techniques and machine learning algorithms, R’s formula object provides a standardized process for specifying the outcomes and inputs to be utilized when a method is applied to a data set. In typical examples, e.g. R’s help file for formula objects, a model is specified in a manual way with a formula such as y ~ a + b + c. For parsimonious models specified by a programmer, a manual selection and entry can be sufficient. However, a variety of applications can present more challenging circumstances in which manual specification may not be an effective strategy. Dynamically generated models may be specified by the user of a graphical interface (e.g. with R’s shiny package). In this case, a programmatic means of specifying a formula based on the user’s selections would be necessary. Even in manual settings, formula objects would benefit from additional quality checks that ensure that the model’s specification is appropriate for the data provided.

Formulaic package has two main functions – formulaic::create.formula and formulaic::reduce.existing.formula – and one subsidiary function, formulaic::add.backtick. The main purpose of developing the package is to help users to build a robust model faster and more convenient.

formulaic::create.formula automatically creates a formula from a provided list of input variables and the output variable. The variables will undergo a series of qualification tests such as automatic variable/categories reduction, typo, duplication, and lack of contrasted features elimination, etc. to make sure that a given feature is usable for modeling. This will reduce the time to build a model and set the users free from the trivial maneuver: manually inputting variables for modeling. The outcome of this formula can be used in a wide range from simple linear regression to more complex machine learning techniques such as random forest, neural network, etc.

The principal advantages of using formulaic::create.formula are followed:

  1. Being able to dynamically generate a formula from a vector of inputs, without necessarily having to spell them all out by name.

  2. Adding variables by searching for patterns.

  3. Simple integration of interactions.

  4. Easy removal of specific variables.

  5. Quality checks that resolve a variety of issues – typos, duplication, lack of contrast, etc. – while providing a transparent explanation.

formulaic::reduce.existing.formula trims an existing formula down. Users plug an existing formula into the function, then it will undergo the same test as formulaic::create.formula.

formulaic::add.backticks applies backticks the variables needs backticks to be employed in a formula as default. Users can also add backticks to all the variables; however, it is not necessary.

Formulaic is useful to create a dynamic formula with multiple features. It not only diminishes the time required for modeling and implementing, but also enriches the quality of the result.

awareness.name = "Awareness"
variable.names = c("Age", "Gender", "Income Group", "Region", "Persona", "Typo")

ex.form <-
  formulaic::create.formula(outcome.name = awareness.name,
                 input.names = 'variable.names',
                 dat = snack.dat)

ex.form$formula
#> Awareness ~ 1
#> <environment: 0x000000001ea9be38>
lm_example <- lm(formula = ex.form, data = snack.dat)
summary(lm_example)
#> 
#> Call:
#> lm(formula = ex.form, data = snack.dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5258 -0.5258  0.4742  0.4742  0.4742 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 0.525826   0.003293   159.7   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4993 on 22999 degrees of freedom

Dynamic Generation of a Formula

A formula object may be one component of a larger system of software that processes data, generates models, and reports information. Dynamic applications with user interfaces, such as those generated with the shiny package, can allow a user to specify many of the parameters. This may include the type of model to fit, the outcome and input variables, and filters on the subset of the data to incorporate.

In this application, the user is provided with a wide array of choices. A variety of outcomes related to customer engagement may be modeled. The user can select a subset of data related to a specific brand or aggregate multiple brands together. The user may also choose from a menu of inputs spanning all relevant columns of the data set. Then these data can be filtered into specific subgroups based on selections across a number of variables, including age groups, gender, income groups, region, etc.

Because the user’s selections are dynamic, the modeling formula must be generated programmatically. The formulaic::create.formula function includes parameters for the outcome.name – a character vector of length 1 – and the input.names – a character vector of any length. As an example, if the user provides specific selections, then formulaic::create.formula will automatically generate the corresponding formula object:

Dataset (snack.dat)

For the illustration of the basic functions of the formulaic package, we generated a dataset, named snack.dat.

Formatted as data.table object, snack.dat contains 23000 observations and 25 columns. These data contain simulated information from a fictionalized marketing survey. In this survey, a progression of questions was asked about the respondents’ awareness, consideration, consumption, satisfaction with, and advocacy for different brands of snack foods. Questions downstream of awareness, consideration, and then consumption would be asked only for those respondents who responded affirmatively to the previous question. Otherwise, the values are missing. Brand Perception questions are rated on a scale from 0 to 10 and indicated with a name starting with the prefix BP.

Adding Backticks (formulaic::add.backtick)

As a subsidiary function, formulaic::add.backtick is used inside of formulaic::create.formula function that adds backticks to the names of the variables. Formula objects include the names of different variables within a data.frame. When these names contain a space, the name must be encapsulated in backticks to ensure proper formatting. For instance, if there are three variables called y, x1, and User ID, then a formula formatted as y ~ x1 + User ID will generate errors due to the space in User ID. Instead, this formula can be properly formatted as y ~ x1 + `User ID`. Meanwhile, it is also acceptable to add backticks to the other names, such as `y` ~ `x1` + `User ID`, but this is not a necessary step. As a default, the include.backtick is set to ‘as.needed’, which indicates that the function will only add backticks to the variables that require them. The user has the freedom to change the option to ‘all’. Yet, it is only compatible when format.as != “formula”, in which case a character object is returned. In particular, a formula object will automatically remove unnecessary backticks.

NOTE: In the snack.dat data, User ID, Age Group, and Income Group are the only variables that are affected by the function when the included.backtick is set as ‘as.needed’, while every variable has backticks when it is set as ‘all’.

This feature is automatically incorporated into formulaic’s formulaic::create.formula method:

When the output is returned as a formula object, the backticks may only be provided on an as-needed basis. For character objects, either option may be selected.

Creating Formula (formulaic::create.formula):

The formulaic::create.formula function is designed to automatically generate formulas from user-specified inputs and output. The range of inputs may include directly specified variables, patterns to search within the names of an associated data.frame, a list of interactions, and a vector of variables to directly exclude from consideration. The method also provides a range of quality checks that can detect issues with the construction of formula and, at the user’s discretion, automatically remove variables that would otherwise generate errors. These quality checks include formatting variables with backticks, de-duplication, ensuring correspondence with the names of the variables in an associated data.frame, excluding categorical variables that would generate errors due to a lack of contrast or exceed a user-specified threshold for the maximum number of categories, and automatically removing interactions involving variables that should be excluded. When directed by the user, these quality checks can be implemented to effectively reduce a formula to the subset of variables and interactions that would be appropriate for consideration in a statistical model. The output of the function can be formatted as either a formula object or a character.

Parameter description:

  • outcome.name A character value specifying the name of the formula’s outcome variable. In this version, only a single outcome may be included. The first entry of outcome.name will be used to build the formula.

  • input.names The names of the variables with the full names delineated.

  • input.patterns Includes additional input variables. The user may enter patterns – e.g. to include every variable with a name that includes the pattern. Multiple patterns may be included as a character vector. However, each pattern may not contain spaces and is otherwise subject to the same limits on patterns as used in the grep function.

  • dat User can specify a data.frame object that will be used to remove any variables that are not listed in names(dat. As default it is set as NULL. In this case, the formula is created simply from the outcome.name and input.names.

  • interactions A list of character vectors. Each character vector includes the names of the variables that form a single interaction. Specifying interactions = list(c(“x”, “y”), c(“x”, “z”), c(“y”, “z”), c(“x”, “y”, “z”)) would lead to the interactions xy + xz + yz + xy*z. #’ @param force.main.effects This is a logical value. When TRUE, the intent is that any term included as an interaction (of multiple variables) must also be listed individually as a main effect.

  • reduce A logical value. When dat is not NULL and reduce is TRUE, additional quality checks are performed to examine the input variables. Any input variables that exhibit a lack of contrast will be excluded from the model. This search is global by default but may be conducted separately in subsets of the outcome variables by specifying max.outcome.categories.to.search. Additionally, any input variables that exhibit too many contrasts, as defined by max.input.categories, will also be excluded.

  • max.input.categories Limits the maximum number of variables that will be employed in the formula. As default it is set at 20, but users can still change at his/her convenience.

  • max.outcome.categories.to.search A numeric value. The formulaic::create.formula function includes a feature that identifies input variables exhibiting a lack of contrast. When reduce = TRUE, these variables are automatically excluded from the resulting formula. This search may be expanded to subsets of the outcome when the number of unique measured values of the outcome is no greater than max.outcome.categories.to.search. In this case, each subset of the outcome will be separately examined, and any inputs that exhibit a lack of contrast within at least one subset will be excluded.

  • order.as User can specify the order the input variables in the formula in a variety of ways for patterns: increasing for increasing alphabet order, decreasing for decreasing alphabet order, column.order for as they appear in data, and as.specified for maintaining the user’s specified order.

  • include.backtick Add backticks if needed. As default it is set as ‘as.needed’, which add backticks when only it is needed. The other option is ‘all’. The use of include.backtick = “all” is limited to cases in which the output is generated as a character variable. When the output is generated as a formula object, then R automatically removes all unnecessary backticks. That is, it is only compatible when format.as != formula.

  • format.as The data type of the output. If not set as “formula”, then a character vector will be returned.

  • variables.to.exclude A character vector. Any variable specified in variables.to.exclude will be dropped from the formula, both in the individual inputs and in any associated interactions. This step supersedes the inclusion of any variables specified for inclusion in the other parameters.

  • include.intercept A logical value. When FALSE, the intercept will be removed from the formula.

Creating Interactions of Variables

The function allows users to incorporate interaction terms easily with the interactions parameter. Each interaction would be specified as a character vector, and the entire range of interactions is entered as a list, which allows for different interactions to include a different number of variables:

Selecting Variables from Patterns

Large data sets may include classes of variables that are identified with a common pattern within their names. Rather than including each variable individually, it can be helpful to programmatically identify all of the variables that correspond to a specific pattern. For instance, the variables with prefix of BP_ in the snack.dat dataset.

When a set of patterns is specified with the input.patterns parameter, the formulaic::create.formula function identifies any variable that includes at least one of these patterns for inclusion in the formula. In order to do so, the user must also specify the data to be searched. As an example, consider the example below:

bp.pattern = "BP_"
input.patterns = c("Gend", bp.pattern)

pattern.form <-
  formulaic::create.formula(
    outcome.name = awareness.name,
    input.names = input.names,
    dat = snack.dat,
    input.patterns = input.patterns
  )

print(pattern.form)
#> $formula
#> Awareness ~ Age + Gender + Income + Region + Persona + BP_For_Me_0_10 + 
#>     BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + BP_Good_To_Share_0_10 + 
#>     BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + BP_Everyday_Snack_0_10 + 
#>     BP_Healthy_0_10 + BP_Delicious_0_10 + BP_Right_Amount_0_10 + 
#>     BP_Relaxing_0_10
#> <environment: 0x000000001e6f32b0>
#> 
#> $inclusion.table
#>                      variable   class order specified.from
#>  1:                       Age integer     1    input.names
#>  2:                    Gender  factor     2    input.names
#>  3:                    Income numeric     3    input.names
#>  4:                    Region  factor     4    input.names
#>  5:                   Persona  factor     5    input.names
#>  6:                      Typo    <NA>     6    input.names
#>  7:            BP_For_Me_0_10 integer     7 input.patterns
#>  8:       BP_Fits_Budget_0_10 integer     8 input.patterns
#>  9:      BP_Tastes_Great_0_10 integer     9 input.patterns
#> 10:     BP_Good_To_Share_0_10 integer    10 input.patterns
#> 11:         BP_Like_Logo_0_10 integer    11 input.patterns
#> 12: BP_Special_Occasions_0_10 integer    12 input.patterns
#> 13:    BP_Everyday_Snack_0_10 integer    13 input.patterns
#> 14:           BP_Healthy_0_10 integer    14 input.patterns
#> 15:         BP_Delicious_0_10 integer    15 input.patterns
#> 16:      BP_Right_Amount_0_10 integer    16 input.patterns
#> 17:          BP_Relaxing_0_10 integer    17 input.patterns
#>     exclude.user.specified exclude.not.in.names.dat
#>  1:                  FALSE                    FALSE
#>  2:                  FALSE                    FALSE
#>  3:                  FALSE                    FALSE
#>  4:                  FALSE                    FALSE
#>  5:                  FALSE                    FALSE
#>  6:                  FALSE                     TRUE
#>  7:                  FALSE                    FALSE
#>  8:                  FALSE                    FALSE
#>  9:                  FALSE                    FALSE
#> 10:                  FALSE                    FALSE
#> 11:                  FALSE                    FALSE
#> 12:                  FALSE                    FALSE
#> 13:                  FALSE                    FALSE
#> 14:                  FALSE                    FALSE
#> 15:                  FALSE                    FALSE
#> 16:                  FALSE                    FALSE
#> 17:                  FALSE                    FALSE
#>     exclude.matches.outcome.name include.variable
#>  1:                        FALSE             TRUE
#>  2:                        FALSE             TRUE
#>  3:                        FALSE             TRUE
#>  4:                        FALSE             TRUE
#>  5:                        FALSE             TRUE
#>  6:                        FALSE            FALSE
#>  7:                        FALSE             TRUE
#>  8:                        FALSE             TRUE
#>  9:                        FALSE             TRUE
#> 10:                        FALSE             TRUE
#> 11:                        FALSE             TRUE
#> 12:                        FALSE             TRUE
#> 13:                        FALSE             TRUE
#> 14:                        FALSE             TRUE
#> 15:                        FALSE             TRUE
#> 16:                        FALSE             TRUE
#> 17:                        FALSE             TRUE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

In this example, the age group was directly specified by the user. Gender was incorporated due to the first pattern, and then all of the brand perceptions were selected based on the second pattern, “BP_”.

Selecting All of the Variables

The formulaic::create.formula function maintains this capability when “.” is included in the input.names and a data set is provided:

dot.form.1 <-
  formulaic::create.formula(outcome.name = awareness.name,
                 input.names = ".",
                 dat = snack.dat)

print(dot.form.1)
#> $formula
#> Awareness ~ `User ID` + Age + Gender + Income + Region + Persona + 
#>     Product + BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x000000001f3fd0c0>
#> 
#> $inclusion.table
#>                      variable   class order specified.from
#>  1:                   User ID  factor     1    input.names
#>  2:                       Age integer     2    input.names
#>  3:                    Gender  factor     3    input.names
#>  4:                    Income numeric     4    input.names
#>  5:                    Region  factor     5    input.names
#>  6:                   Persona  factor     6    input.names
#>  7:                   Product  factor     7    input.names
#>  8:                 Awareness integer     8    input.names
#>  9:            BP_For_Me_0_10 integer     9    input.names
#> 10:       BP_Fits_Budget_0_10 integer    10    input.names
#> 11:      BP_Tastes_Great_0_10 integer    11    input.names
#> 12:     BP_Good_To_Share_0_10 integer    12    input.names
#> 13:         BP_Like_Logo_0_10 integer    13    input.names
#> 14: BP_Special_Occasions_0_10 integer    14    input.names
#> 15:    BP_Everyday_Snack_0_10 integer    15    input.names
#> 16:           BP_Healthy_0_10 integer    16    input.names
#> 17:         BP_Delicious_0_10 integer    17    input.names
#> 18:      BP_Right_Amount_0_10 integer    18    input.names
#> 19:          BP_Relaxing_0_10 integer    19    input.names
#> 20:             Consideration integer    20    input.names
#> 21:               Consumption integer    21    input.names
#> 22:              Satisfaction integer    22    input.names
#> 23:                  Advocacy integer    23    input.names
#> 24:                 Age Group  factor    24    input.names
#> 25:              Income Group  factor    25    input.names
#>                      variable   class order specified.from
#>     exclude.user.specified exclude.not.in.names.dat
#>  1:                  FALSE                    FALSE
#>  2:                  FALSE                    FALSE
#>  3:                  FALSE                    FALSE
#>  4:                  FALSE                    FALSE
#>  5:                  FALSE                    FALSE
#>  6:                  FALSE                    FALSE
#>  7:                  FALSE                    FALSE
#>  8:                  FALSE                    FALSE
#>  9:                  FALSE                    FALSE
#> 10:                  FALSE                    FALSE
#> 11:                  FALSE                    FALSE
#> 12:                  FALSE                    FALSE
#> 13:                  FALSE                    FALSE
#> 14:                  FALSE                    FALSE
#> 15:                  FALSE                    FALSE
#> 16:                  FALSE                    FALSE
#> 17:                  FALSE                    FALSE
#> 18:                  FALSE                    FALSE
#> 19:                  FALSE                    FALSE
#> 20:                  FALSE                    FALSE
#> 21:                  FALSE                    FALSE
#> 22:                  FALSE                    FALSE
#> 23:                  FALSE                    FALSE
#> 24:                  FALSE                    FALSE
#> 25:                  FALSE                    FALSE
#>     exclude.user.specified exclude.not.in.names.dat
#>     exclude.matches.outcome.name include.variable
#>  1:                        FALSE             TRUE
#>  2:                        FALSE             TRUE
#>  3:                        FALSE             TRUE
#>  4:                        FALSE             TRUE
#>  5:                        FALSE             TRUE
#>  6:                        FALSE             TRUE
#>  7:                        FALSE             TRUE
#>  8:                         TRUE            FALSE
#>  9:                        FALSE             TRUE
#> 10:                        FALSE             TRUE
#> 11:                        FALSE             TRUE
#> 12:                        FALSE             TRUE
#> 13:                        FALSE             TRUE
#> 14:                        FALSE             TRUE
#> 15:                        FALSE             TRUE
#> 16:                        FALSE             TRUE
#> 17:                        FALSE             TRUE
#> 18:                        FALSE             TRUE
#> 19:                        FALSE             TRUE
#> 20:                        FALSE             TRUE
#> 21:                        FALSE             TRUE
#> 22:                        FALSE             TRUE
#> 23:                        FALSE             TRUE
#> 24:                        FALSE             TRUE
#> 25:                        FALSE             TRUE
#>     exclude.matches.outcome.name include.variable
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

It is unnecessary, but user may want to add another variable as the following example demonstrates. formulaic::create.formula will handle the duplicated variable, here “Gender”, and incorporate the variables that pass the quality checks:


input.names = c("Gender", ".")

dot.form.2 <- formulaic::create.formula(outcome.name = awareness.name, input.names = input.names, dat = snack.dat)

print(dot.form.2)
#> $formula
#> Awareness ~ Gender + `User ID` + Age + Income + Region + Persona + 
#>     Product + BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x000000001d395a20>
#> 
#> $inclusion.table
#>                      variable   class order specified.from
#>  1:                    Gender  factor     1    input.names
#>  2:                   User ID  factor     2    input.names
#>  3:                       Age integer     3    input.names
#>  4:                    Income numeric     4    input.names
#>  5:                    Region  factor     5    input.names
#>  6:                   Persona  factor     6    input.names
#>  7:                   Product  factor     7    input.names
#>  8:                 Awareness integer     8    input.names
#>  9:            BP_For_Me_0_10 integer     9    input.names
#> 10:       BP_Fits_Budget_0_10 integer    10    input.names
#> 11:      BP_Tastes_Great_0_10 integer    11    input.names
#> 12:     BP_Good_To_Share_0_10 integer    12    input.names
#> 13:         BP_Like_Logo_0_10 integer    13    input.names
#> 14: BP_Special_Occasions_0_10 integer    14    input.names
#> 15:    BP_Everyday_Snack_0_10 integer    15    input.names
#> 16:           BP_Healthy_0_10 integer    16    input.names
#> 17:         BP_Delicious_0_10 integer    17    input.names
#> 18:      BP_Right_Amount_0_10 integer    18    input.names
#> 19:          BP_Relaxing_0_10 integer    19    input.names
#> 20:             Consideration integer    20    input.names
#> 21:               Consumption integer    21    input.names
#> 22:              Satisfaction integer    22    input.names
#> 23:                  Advocacy integer    23    input.names
#> 24:                 Age Group  factor    24    input.names
#> 25:              Income Group  factor    25    input.names
#>                      variable   class order specified.from
#>     exclude.user.specified exclude.not.in.names.dat
#>  1:                  FALSE                    FALSE
#>  2:                  FALSE                    FALSE
#>  3:                  FALSE                    FALSE
#>  4:                  FALSE                    FALSE
#>  5:                  FALSE                    FALSE
#>  6:                  FALSE                    FALSE
#>  7:                  FALSE                    FALSE
#>  8:                  FALSE                    FALSE
#>  9:                  FALSE                    FALSE
#> 10:                  FALSE                    FALSE
#> 11:                  FALSE                    FALSE
#> 12:                  FALSE                    FALSE
#> 13:                  FALSE                    FALSE
#> 14:                  FALSE                    FALSE
#> 15:                  FALSE                    FALSE
#> 16:                  FALSE                    FALSE
#> 17:                  FALSE                    FALSE
#> 18:                  FALSE                    FALSE
#> 19:                  FALSE                    FALSE
#> 20:                  FALSE                    FALSE
#> 21:                  FALSE                    FALSE
#> 22:                  FALSE                    FALSE
#> 23:                  FALSE                    FALSE
#> 24:                  FALSE                    FALSE
#> 25:                  FALSE                    FALSE
#>     exclude.user.specified exclude.not.in.names.dat
#>     exclude.matches.outcome.name include.variable
#>  1:                        FALSE             TRUE
#>  2:                        FALSE             TRUE
#>  3:                        FALSE             TRUE
#>  4:                        FALSE             TRUE
#>  5:                        FALSE             TRUE
#>  6:                        FALSE             TRUE
#>  7:                        FALSE             TRUE
#>  8:                         TRUE            FALSE
#>  9:                        FALSE             TRUE
#> 10:                        FALSE             TRUE
#> 11:                        FALSE             TRUE
#> 12:                        FALSE             TRUE
#> 13:                        FALSE             TRUE
#> 14:                        FALSE             TRUE
#> 15:                        FALSE             TRUE
#> 16:                        FALSE             TRUE
#> 17:                        FALSE             TRUE
#> 18:                        FALSE             TRUE
#> 19:                        FALSE             TRUE
#> 20:                        FALSE             TRUE
#> 21:                        FALSE             TRUE
#> 22:                        FALSE             TRUE
#> 23:                        FALSE             TRUE
#> 24:                        FALSE             TRUE
#> 25:                        FALSE             TRUE
#>     exclude.matches.outcome.name include.variable
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Also, if user adds another variable and misspells it that it is not a column name of the dataset as the following example shows, formulaic::create.formula will drop the misspelled variable, here “Typo”, and incorporate the variables that pass the quality checks:


input.names = c("Typo", ".")

dot.form.2 <- formulaic::create.formula(outcome.name = awareness.name, input.names = input.names, dat = snack.dat)

print(dot.form.2)
#> $formula
#> Awareness ~ `User ID` + Age + Gender + Income + Region + Persona + 
#>     Product + BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x000000001dd7f218>
#> 
#> $inclusion.table
#>                      variable   class order specified.from
#>  1:                      Typo    <NA>     1    input.names
#>  2:                   User ID  factor     2    input.names
#>  3:                       Age integer     3    input.names
#>  4:                    Gender  factor     4    input.names
#>  5:                    Income numeric     5    input.names
#>  6:                    Region  factor     6    input.names
#>  7:                   Persona  factor     7    input.names
#>  8:                   Product  factor     8    input.names
#>  9:                 Awareness integer     9    input.names
#> 10:            BP_For_Me_0_10 integer    10    input.names
#> 11:       BP_Fits_Budget_0_10 integer    11    input.names
#> 12:      BP_Tastes_Great_0_10 integer    12    input.names
#> 13:     BP_Good_To_Share_0_10 integer    13    input.names
#> 14:         BP_Like_Logo_0_10 integer    14    input.names
#> 15: BP_Special_Occasions_0_10 integer    15    input.names
#> 16:    BP_Everyday_Snack_0_10 integer    16    input.names
#> 17:           BP_Healthy_0_10 integer    17    input.names
#> 18:         BP_Delicious_0_10 integer    18    input.names
#> 19:      BP_Right_Amount_0_10 integer    19    input.names
#> 20:          BP_Relaxing_0_10 integer    20    input.names
#> 21:             Consideration integer    21    input.names
#> 22:               Consumption integer    22    input.names
#> 23:              Satisfaction integer    23    input.names
#> 24:                  Advocacy integer    24    input.names
#> 25:                 Age Group  factor    25    input.names
#> 26:              Income Group  factor    26    input.names
#>                      variable   class order specified.from
#>     exclude.user.specified exclude.not.in.names.dat
#>  1:                  FALSE                     TRUE
#>  2:                  FALSE                    FALSE
#>  3:                  FALSE                    FALSE
#>  4:                  FALSE                    FALSE
#>  5:                  FALSE                    FALSE
#>  6:                  FALSE                    FALSE
#>  7:                  FALSE                    FALSE
#>  8:                  FALSE                    FALSE
#>  9:                  FALSE                    FALSE
#> 10:                  FALSE                    FALSE
#> 11:                  FALSE                    FALSE
#> 12:                  FALSE                    FALSE
#> 13:                  FALSE                    FALSE
#> 14:                  FALSE                    FALSE
#> 15:                  FALSE                    FALSE
#> 16:                  FALSE                    FALSE
#> 17:                  FALSE                    FALSE
#> 18:                  FALSE                    FALSE
#> 19:                  FALSE                    FALSE
#> 20:                  FALSE                    FALSE
#> 21:                  FALSE                    FALSE
#> 22:                  FALSE                    FALSE
#> 23:                  FALSE                    FALSE
#> 24:                  FALSE                    FALSE
#> 25:                  FALSE                    FALSE
#> 26:                  FALSE                    FALSE
#>     exclude.user.specified exclude.not.in.names.dat
#>     exclude.matches.outcome.name include.variable
#>  1:                        FALSE            FALSE
#>  2:                        FALSE             TRUE
#>  3:                        FALSE             TRUE
#>  4:                        FALSE             TRUE
#>  5:                        FALSE             TRUE
#>  6:                        FALSE             TRUE
#>  7:                        FALSE             TRUE
#>  8:                        FALSE             TRUE
#>  9:                         TRUE            FALSE
#> 10:                        FALSE             TRUE
#> 11:                        FALSE             TRUE
#> 12:                        FALSE             TRUE
#> 13:                        FALSE             TRUE
#> 14:                        FALSE             TRUE
#> 15:                        FALSE             TRUE
#> 16:                        FALSE             TRUE
#> 17:                        FALSE             TRUE
#> 18:                        FALSE             TRUE
#> 19:                        FALSE             TRUE
#> 20:                        FALSE             TRUE
#> 21:                        FALSE             TRUE
#> 22:                        FALSE             TRUE
#> 23:                        FALSE             TRUE
#> 24:                        FALSE             TRUE
#> 25:                        FALSE             TRUE
#> 26:                        FALSE             TRUE
#>     exclude.matches.outcome.name include.variable
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Removing Specific Variables

With multiple ways to specify the variables to include in a formula, it can also be helpful to ensure that a specific variable may not be included. As an example, when utilizing the input.patterns to include all of the brand perception variables, we can specifically remove BP_Delicious_0_10 and Gender by specifying the variables.to.exclude parameter. The parameter supersedes any variables mentioned in input.names as well as interactions:

input.names <-
  c("Age",
    "Gender",
    "Income",
    "Region",
    "Persona",
    "Typo",
    "Age Group")
interactions <-
  list(
    c("Age", "Gender"),
    c("Age", "Income"),
    c("Age", "Gender", "Income"),
    c("Gender", "Inco"),
    c("Age", "Reg ion")
  )
bp.pattern = "BP_"
variables.to.exclude = c("BP_Delicious_0_10", "Gender")

variables.to.exclude.form <-
  formulaic::create.formula(
    outcome.name = awareness.name,
    input.names = input.names,
    interactions = interactions,
    input.patterns = bp.pattern,
    variables.to.exclude = variables.to.exclude,
    dat = snack.dat
  )


print(variables.to.exclude.form)
#> $formula
#> Awareness ~ Age + Income + Region + Persona + `Age Group` + BP_For_Me_0_10 + 
#>     BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + BP_Good_To_Share_0_10 + 
#>     BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + BP_Everyday_Snack_0_10 + 
#>     BP_Healthy_0_10 + BP_Right_Amount_0_10 + BP_Relaxing_0_10 + 
#>     Age * Income
#> <environment: 0x000000001e6286e8>
#> 
#> $inclusion.table
#>                      variable   class order specified.from
#>  1:                       Age integer     1    input.names
#>  2:                    Gender  factor     2    input.names
#>  3:                    Income numeric     3    input.names
#>  4:                    Region  factor     4    input.names
#>  5:                   Persona  factor     5    input.names
#>  6:                      Typo    <NA>     6    input.names
#>  7:                 Age Group  factor     7    input.names
#>  8:            BP_For_Me_0_10 integer     8 input.patterns
#>  9:       BP_Fits_Budget_0_10 integer     9 input.patterns
#> 10:      BP_Tastes_Great_0_10 integer    10 input.patterns
#> 11:     BP_Good_To_Share_0_10 integer    11 input.patterns
#> 12:         BP_Like_Logo_0_10 integer    12 input.patterns
#> 13: BP_Special_Occasions_0_10 integer    13 input.patterns
#> 14:    BP_Everyday_Snack_0_10 integer    14 input.patterns
#> 15:           BP_Healthy_0_10 integer    15 input.patterns
#> 16:         BP_Delicious_0_10 integer    16 input.patterns
#> 17:      BP_Right_Amount_0_10 integer    17 input.patterns
#> 18:          BP_Relaxing_0_10 integer    18 input.patterns
#> 19:                      Inco    <NA>    19   interactions
#> 20:                   Reg ion    <NA>    20   interactions
#>     exclude.user.specified exclude.not.in.names.dat
#>  1:                  FALSE                    FALSE
#>  2:                   TRUE                    FALSE
#>  3:                  FALSE                    FALSE
#>  4:                  FALSE                    FALSE
#>  5:                  FALSE                    FALSE
#>  6:                  FALSE                     TRUE
#>  7:                  FALSE                    FALSE
#>  8:                  FALSE                    FALSE
#>  9:                  FALSE                    FALSE
#> 10:                  FALSE                    FALSE
#> 11:                  FALSE                    FALSE
#> 12:                  FALSE                    FALSE
#> 13:                  FALSE                    FALSE
#> 14:                  FALSE                    FALSE
#> 15:                  FALSE                    FALSE
#> 16:                   TRUE                    FALSE
#> 17:                  FALSE                    FALSE
#> 18:                  FALSE                    FALSE
#> 19:                  FALSE                     TRUE
#> 20:                  FALSE                     TRUE
#>     exclude.matches.outcome.name include.variable
#>  1:                        FALSE             TRUE
#>  2:                        FALSE            FALSE
#>  3:                        FALSE             TRUE
#>  4:                        FALSE             TRUE
#>  5:                        FALSE             TRUE
#>  6:                        FALSE            FALSE
#>  7:                        FALSE             TRUE
#>  8:                        FALSE             TRUE
#>  9:                        FALSE             TRUE
#> 10:                        FALSE             TRUE
#> 11:                        FALSE             TRUE
#> 12:                        FALSE             TRUE
#> 13:                        FALSE             TRUE
#> 14:                        FALSE             TRUE
#> 15:                        FALSE             TRUE
#> 16:                        FALSE            FALSE
#> 17:                        FALSE             TRUE
#> 18:                        FALSE             TRUE
#> 19:                        FALSE            FALSE
#> 20:                        FALSE            FALSE
#> 
#> $interactions.table
#>             interactions include.interaction
#> 1:          Age * Gender               FALSE
#> 2:          Age * Income                TRUE
#> 3: Age * Gender * Income               FALSE
#> 4:         Gender * Inco               FALSE
#> 5:       Age * `Reg ion`               FALSE

Quality Checks

With the formulaic::create.formula function, the formulaic package devises a range of quality checks that investigate the design of a formula. The degree of quality checks can be controlled by the user at several levels. When the user specifies that quality checks should be performed, the formulaic::create.formula method builds objects called inclusion.table and interactions.table, which form a portion of the method’s output. The inclusion.table object is a data.frame that reports on each variable that was considered for inclusion in the final list of inputs. Ultimately, the inclusion.table object will include a variety of columns, one for each quality check, that each indicates whether a variable should be excluded. Once all of the specified quality checks have been performed, the include.variable column is computed as an overall indicator of whether the specified variable should be included as an input in the formula object.

The interactions.table follows a similar logic. An interaction will be excluded if any of the variables in its components was excluded based on the quality checks in the inclusion.table.

Outcomes as Inputs

Most formula objects would not include the outcome variable as an input. However, when such a formula is constructed, whether by mistake or with intention, there is a lack of consistency is the way many common models handle the issue(outcomes as inputs situation). For instance, Income ~ Age + Income. The function drops the outcome variable in inputs automatically, and return the formula as followed: Income ~ Age.

Misspecified Variables

A formula object in R can only be supplied to a model when all of its terms directly match the names of the data.frame object on which the model will be fit. Misspecified variables within a formula, such as those arising from typographical errors, will typically lead to error messages in R’s implementation of a model. The formulaic package provides the option to either a) maintain this effect or b) automatically remove any misspecified variables. When a user supplies a dataset to the formulaic::create.formula function, the variables intended for the formula receive a quality check to ensure that they match a corresponding name within the associated data.frame. Misspecified variables will be marked in the inclusion.table portion of the output of the formulaic::create.formula function. Any misspecified variables or associated interactions will be removed from the formula in this setting.

Considerations for Feature Engineering

Selecting appropriate variables for a statistical model can include challenges associated with the domain, methodology, computational considerations, and practical limitations of the data. Some variables may not be suitable for inclusion based on either a lack of contrast or a large number of categories. This section will explore these problems in greater detail. In doing so, we will demonstrate how the formulaic package can automatically identify and handle these issues.

A Lack of Contrast

Statistical models typically estimate the relationship between outcome and the inputs based upon the impact of changes in the inputs. When a variable is constant across all of its measured values, its variance is zero, and therefore the variable’s correlation with another variable is undefined. A constant input variable therefore exhibits a lack of contrast with regard to estimating its impact on an outcome. Many statistical models in R will return error messages when an input is a constant variable or consists only of missing data. If a large number of variables are included, then each error message will only identify the first such variable. An iterative process may be required to remove variables with a lack of contrast. Furthermore, even in variables that exhibit variation across the full range of the data, a lack of contrast may yet arise when a model is fit on a subset of these data.

Numeric Variables With No Variation

A model of consideration, estimated on the rows for which this outcome is measured, would therefore only include values of 1 for the respondents’ awareness. A logistic regression that includes awareness as an input would therefore generate a missing value for the coefficient of awareness:

Because the awareness variable lacks variation in this subset, it is not suitable for use as a predictor of consideration. (It should instead be viewed as a prerequisite.) This matter can be resolved through the use of the reduce parameter in the formulaic::create.formula function. When reduce = TRUE and a dataset is provided for inspection, formulaic::create.formula automatically performs quality checks on all of the potential input variables. Any variables with a lack of variation will be identified and proactively excluded from the formula. Meanwhile, a record of this inspection is provided in the inclusion.table’s output:

Categorical Variables With No Variation

To incorporate categorical variables with k > 1 different measured values, statistical models typically code separate columns of indicator variables across k-1 categories, while the kth category serves as a reference. Without variation (k <= 1), this procedure cannot code any indicator variables. Without a meaningful way to include such a variable as an input, the model will instead generate an error message.

As an example, consider a model generated on the subset of respondents between the ages of 18 and 35 years old. This represents one category of the possible age groups. If a logistic regression model nonetheless attempted to include the age group as an input, this would lead to the following result:

#{r formulaic::create.formula with lack of contrast 1} #glm(formula = formula.awareness$formula, data = #snack.dat[get(age.group.name) == "[ 18, 35)",], family = "binomial") #

This particular example is designed to demonstrate the issue with a simple contradiction, and its root cause is easy to identify. In real applications, significant investigation may be required to determine which variables may be causing such an effect. The error message provided informs the user of a lack of contrast, but it does not identify which variable is causing the issue. In a formula that incorporates many inputs, there may be a number of different variables that each contribute to the issue.

Within the formulaic package, the formulaic::create.formula’s reduce parameter can be used to automatically identify categorical variables with a lack of contrast. When reduce = TRUE and a data set is provided, inputs with no variation are excluded from the resulting formula. The exclude.lack.contrast column of the output’s inclusion.table identifies which variables include a lack of contrast, and the min.categories column identifies the number of unique values for each variable. This is demonstrated with the call to formulaic::create.formula below:

A Lack of Contrast within Subsets of the Data

Due to the snack.dat’s series of survey questions, many of the measured variables for a brand are recorded downstream from the initial question about the respondent’s awareness. These questions are only asked to the respondents who indicate awareness. As shown previously, the values of consideration (1 or 0) only occur when awareness is equal to 1. Across the full range of the data, the consideration variable includes multiple values and exhibits variation. However, within the subgroup of respondents who are not aware of the specific product, all of the values are missing. Due to this structurally missing design, it can be necessary to search for a lack of contrast within subsets of the outcome variable. The formulaic::create.formula function allows the users to specify the max.outcome.categories.to.search. When the number of unique values of the outcome is less than or equal to the value of max.outcome.categories.to.search, a data set is provided, and reduce = TRUE, then the search for a lack of contrast is extended into the subsets based on the outcome variable.

As an example, consider a model of consideration that attempts to utilize awareness as an input. The consideration outcome has two unique measured values (1 and 0). If max.outcome.categories.to.search = 1, then the subgroups of consideration will not be searched for a lack of contrast. Instead, the only quality check related to variation will examine each variable for a global lack of contrast. In the case of awareness, it exhibits variation at the global level with binary outcomes. This selection is depicted below:

However, if max.outcome.categories.to.search >= 2, then the consideration variable would qualify as having sufficiently few unique values. Then each subset would subsequently be searched for a lack of contrast in each of the possible inputs. When consideration is 1 or 0, the awareness variable is always 1. Therefore, the inclusion.table’s calculation of the min.categories will be reduced from 2 (in the prior example) to 1 (below). As a result, the exclude.lack.contrast entry for the awareness variable will be flipped from FALSE to TRUE, and awareness will be removed from the formula.

A Large Volume of Levels in a Categorical Variable

As previously discussed, a statistical model that incorporates categorical variables with k > 1 unique values will code k-1 separate columns of indicator variables. Variables displaying user-generated text or unique identifiers may have unique values in all or nearly all of the rows of the data set. Large values of k in a single variable can create computational burdens or lead to intractable structures. Models with such a large number of additional columns may run nearly interminably without any indication of the underlying issue or an estimate of the time to completion.

To avoid this issue, formulaic’s formulaic::create.formula function allows the user to specify the max.input.categories. Each categorical variable’s number of levels k is computed at a global level. Any such variable with a value of k greater than max.input.categories is automatically excluded from consideration. This shows up in the calculation of the min.categories value and subsequently the exclude.numerous.categories of the inclusion.table.

As an example, the snack.dat’s User ID variable is a character vector that indicates which of the 1000 respondents supplied the answers for the given row. Including the User ID in a model would therefore generate 999 columns of indicator variables. When reduce = TRUE, a data set is supplied, and max.input.categories is set at a value below 1000, then the User ID would be automatically excluded from the formula:

Inspection of All Variables

When reduce = TRUE and a data set is supplied, the formulaic::create.formula function provides a range of quality checks and information about the merits of including specific variables as possible inputs in a model. From the list of all of the variables, a user can quickly identify a reduced list for potential inclusion. As an example, we use the snack.dat to show that a model of awareness would need to be limited to a subset of the overall variables:

All of the brand perceptions and other states of engagement were removed from the formula. This was due to a lack of contrast arising from the structurally missing values when the respondents were not aware of the product. Meanwhile, the User ID was removed due to its large number of categories. Only the names of the products and the respondent-specific variables remain. From this list, an investigator could then make selections of which variables to include (e.g. Age or Age Group). However, much of the preliminary investigation would be handled automatically. This is especially helpful in settings in which the full relationship of the variables – such as the sequence and dependencies of the marketing survey’s questions – is not yet fully understood.

Reducing an Existing Formula (formulaic::reduce.existing.formula):

The formulaic::reduce.existing.formula function was designed to perform quality checks and automatic removal of impractical variables can also be accessed when an existing formula has been previously constructed. This method uses natural language processing techniques to deconstruct the components of a formula. Each variable and interaction is separately identified and aggregated. These variables are then supplied to formulaic::create.formula as the input.names and interactions parameters. Otherwise, the parameters of formulaic::reduce.existing.formula are designed to match those of formulaic::create.formula. As a result, an initial formula can be evaluated in terms of the same set of quality checks, and the formula can be reduced based on the same set of exclusions.

Parameter description:

  • the.initial.formula object of class “lm” or for multiple responses of class c(“mlm”, “lm”).
  • dat Data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model.
  • max.input.categories This limits the maximum number of variables that will be employed in the formula. As default it is set at 20, but users can still change at his/her convenience.
  • max.outcome.categories.to.search This limits the maximum number of outcome categories will be investigated in the formula. As default it is set at 4, but users can still change at his/her convenience
  • order.as rearranges its first argument into ascending or descending order.
  • include.backtick Add backticks to make a appropriate variable
  • format.as The data type of the output. If not set as “formula”, then a character vector will be returned.

As an example, we will demonstrate that a user-supplied formula will produce the same results as that created in the previous section: