Customization within bestNormalize

Ryan A Peterson

2020-06-08

Custom functions with bestNormalize

This vignette will go over the steps required to implement a custom user-defined function within the bestNormalize framework.

There are 3 steps.

  1. Create transformation function

  2. Create predict method for transformation function (that can be applied to new data)

  3. Pass through new function and predict method to bestNormalize

Example: cube-root

S3 methods

Here, we start by defining a new function that we’ll call cuberoot_x, which will take an argument a (as does the sqrt_x function) which will try to add a constant if it sees any negative numbers in x. It will also take the argument standardize which will center and scale the transformed data so that it’s centered at 0 with SD = 1.

Note that we assigned a class to the object this returns of the same name; this is necessary for successful implementation within bestNormalize. We’ll also need an associated predict method that is used to apply the transformation to newly observed data. `

Optional: print method

This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function.

Note: if you can find a similar transformation in the source code, it’s easy to model your code after it. For instance, for cuberoot_x and predict.cuberoot_x, I used sqrt_x.R as a template file.

Implementing with bestNormalize

## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 1.2347
##  - Box-Cox: 1.0267
##  - cuberoot_x: 0.9787
##  - Exp(x): 4.7947
##  - Log_b(x+a): 1.3547
##  - No transform: 2.0027
##  - orderNorm (ORQ): 1.1627
##  - sqrt(x + a): 1.0907
##  - Yeo-Johnson: 1.0987
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized cuberoot(x + a) Transformation with 100 nonmissing obs.:
##  Relevant statistics:
##  - a = 0 
##  - mean (before standardization) = 0.9588261 
##  - sd (before standardization) = 0.3298665

Evidently, the cube-rooting was the best normalizing transformation!

Sanity check

Is this code actually performing the cube-rooting?

## [1] "Mean relative difference: 1.031018"
## [1] "Mean relative difference: 1.031018"

It does indeed.

Using custom normalization statistics

The bestNormalize package can estimate any univariate statistic using its CV framework. A user-defined function can be passed in through the norm_stat_fn argument, and this function will then be applied in lieu of the Pearson test statistic divided by its degree of freedom.

The user-defined function must take an argument x, which indicates the data on which a user wants to evaluate the statistic.

Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic:

bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$stat)
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.1958
##  - Box-Cox: 0.1785
##  - Exp(x): 0.3299
##  - Log_b(x+a): 0.1959
##  - No transform: 0.2219
##  - orderNorm (ORQ): 0.186
##  - sqrt(x + a): 0.1829
##  - Yeo-Johnson: 0.1872
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193 
##  - mean (before standardization) = -0.1263882 
##  - sd (before standardization) = 0.9913552

Here is an example using Lillifors (Kolmogorov-Smirnov) normality test’s p-value:

(dont_do_this <- bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$p))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.4327
##  - Box-Cox: 0.4831
##  - Exp(x): 0.0675
##  - Log_b(x+a): 0.3589
##  - No transform: 0.2958
##  - orderNorm (ORQ): 0.4492
##  - sqrt(x + a): 0.4899
##  - Yeo-Johnson: 0.4531
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized exp(x) Transformation with 100 nonmissing obs.:
##  Relevant statistics:
##  - mean (before standardization) = 6.885396 
##  - sd (before standardization) = 13.66084

Note: bestNormalize will attempt to minimize this statistic by default, which is definitely not what you want to do when calculating the p-value. This is seen in the example above, as the WORST normalization transformation is chosen.

In this case, a user is advised to either manually select the best one:

best_transform <- names(which.max(dont_do_this$norm_stats))
(do_this <- dont_do_this$other_transforms[[best_transform]])
## Standardized sqrt(x + a) Transformation with 100 nonmissing obs.:
##  Relevant statistics:
##  - a = 0 
##  - mean (before standardization) = 0.9811849 
##  - sd (before standardization) = 0.4779252

Or, the user can reverse their defined statistic (in this case by subtracting it from 1):

(do_this <- bestNormalize(x, norm_stat_fn = function(x) 1-nortest::lillie.test(x)$p))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.5166
##  - Box-Cox: 0.4191
##  - Exp(x): 0.9601
##  - Log_b(x+a): 0.5338
##  - No transform: 0.6521
##  - orderNorm (ORQ): 0.4646
##  - sqrt(x + a): 0.4475
##  - Yeo-Johnson: 0.4773
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193 
##  - mean (before standardization) = -0.1263882 
##  - sd (before standardization) = 0.9913552