An Introduction to OptimClassifier

OptimClassifier explained in one Minute 🕛

OptimClassifier provides a set of tools for creating models, selecting the best parameters combination for a model, and select the best threshold for your binary classification. The package contains tools for:

Linear Model (LM)
Generalized Linear Model (GLM)
Linear Mixed Model (LMM)
Classification And Regression Tree (CART)
Discriminant Analysis (DA)
Neural Networks (NN)
Support Vector Machines (SVM)

Take a quick look at functions

The main function could summarize in this table:

	Optim.
Method
Threshold optimization	✅	✅	✅	✖️*	✖️*	✅	✅
Parameter Optimization	✅	✅	✅	✅*	✅*	✅	✅
What parameter or option?	Transformations	Family & Links	Random variable	Linear or Quadratic	CP	Hidden layers	Kernels

*These models are natively classifiers.

Installation

Install this package from CRAN (stable version):

install.packages("OptimClassifier")

Install this package from Github (development version):

For this, you can choose different packages such as:

With devtools

library(devtools)
install_github("economistgame/OptimClassifier")

With remotes

library(remotes)
install_github("economistgame/OptimClassifier")

A simple example

The example shows you how to solve a common credit scoring problem with this package and GLM methodology.

Firstly, we must load the dataset. In this example, we use Australian Credit.

## Load a Dataset
data(AustralianCredit)

Then we create a model with the Optim.GLM function (or the one you want).

## Create the model
creditscoring <- Optim.GLM(Y~., AustralianCredit, p = 0.7, seed=2018)
#> Warning: Thresholds' criteria not selected. The success rate is defined as the default. 
#> 
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Now you can print the results of the models


### See a ranking of the models tested
print(creditscoring)
#> 7 successful models have been tested and 21 thresholds evaluated 
#>  
#>                Model      rmse Threshold success_rate   ti_error tii_error 
#>  1      poisson(log) 0.3389622      0.50    0.8365385 0.01442308 0.1490385 
#>  2     poisson(sqrt) 0.3409605      0.55    0.8365385 0.01442308 0.1490385 
#>  3          gaussian 0.3425880      0.60    0.8413462 0.01923077 0.1394231 
#>  4           poisson 0.3430554      0.55    0.8365385 0.01442308 0.1490385 
#>  5 binomial(cloglog) 0.3519640      0.35    0.8269231 0.01923077 0.1538462 
#>  6  binomial(probit) 0.3587217      0.45    0.8221154 0.02403846 0.1538462 
#>  7   binomial(logit) 0.3595036      0.35    0.8173077 0.01923077 0.1634615

Are you see with a graphic? Try to typping plot(creditscoring)


### Are you bored of R outputs?? Try to plot
plot(creditscoring)

But what is the information (coefficients and others things) of the best model? And the secondth in the rank list?. Simply we can see:

### Access to summary of the best model
summary(creditscoring)
#> 
#> Call:
#> stats::glm(formula = formula, family = rowfamily, data = training, 
#>     model = FALSE, x = FALSE, y = FALSE)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -0.73227  -0.08182  -0.00412   0.10131   0.91599  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -8.844e-02  2.761e-01  -0.320    0.749    
#> X11          5.292e-05  9.104e-02   0.001    1.000    
#> X2          -1.919e-04  4.000e-03  -0.048    0.962    
#> X3          -2.322e-03  8.483e-03  -0.274    0.784    
#> X42          3.292e-02  9.671e-02   0.340    0.734    
#> X43          3.393e-01  7.245e-01   0.468    0.640    
#> X52          2.484e-01  5.225e-01   0.475    0.634    
#> X53          2.495e-01  5.126e-01   0.487    0.626    
#> X54          2.198e-01  4.875e-01   0.451    0.652    
#> X55          2.165e-01  6.225e-01   0.348    0.728    
#> X56          2.181e-01  5.141e-01   0.424    0.671    
#> X57          3.087e-01  5.140e-01   0.601    0.548    
#> X58          2.840e-01  4.915e-01   0.578    0.563    
#> X59          2.887e-01  5.053e-01   0.571    0.568    
#> X510         3.009e-01  5.611e-01   0.536    0.592    
#> X511         2.911e-01  4.994e-01   0.583    0.560    
#> X512         2.102e-01  9.427e-01   0.223    0.824    
#> X513         3.624e-01  4.964e-01   0.730    0.465    
#> X514         3.864e-01  5.059e-01   0.764    0.445    
#> X62         -9.621e-02  6.912e-01  -0.139    0.889    
#> X63         -1.236e-01  7.016e-01  -0.176    0.860    
#> X64         -1.320e-01  4.755e-01  -0.278    0.781    
#> X65         -1.132e-01  4.977e-01  -0.228    0.820    
#> X67         -6.857e-04  6.103e-01  -0.001    0.999    
#> X68         -1.445e-01  4.777e-01  -0.302    0.762    
#> X69         -2.363e-01  6.580e-01  -0.359    0.720    
#> X7           6.848e-03  1.244e-02   0.551    0.582    
#> X81          4.490e-01  9.878e-02   4.545 5.48e-06 ***
#> X91          6.807e-02  1.029e-01   0.661    0.508    
#> X10          2.183e-03  8.015e-03   0.272    0.785    
#> X111        -2.626e-02  7.958e-02  -0.330    0.741    
#> X122         3.305e-02  1.626e-01   0.203    0.839    
#> X123         4.749e-01  4.169e-01   1.139    0.255    
#> X13         -1.577e-04  2.677e-04  -0.589    0.556    
#> X14          2.016e-06  7.898e-06   0.255    0.799    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 82.130  on 481  degrees of freedom
#> Residual deviance: 29.559  on 447  degrees of freedom
#> AIC: 1200.4
#> 
#> Number of Fisher Scoring iterations: 4
### Access to summary of the secondth model
summary(creditscoring,2)
#> 
#> Call:
#> stats::glm(formula = formula, family = rowfamily, data = training, 
#>     model = FALSE, x = FALSE, y = FALSE)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -0.73617  -0.08578   0.00034   0.11181   0.92131  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  9.519e-01  1.563e-01   6.092 1.12e-09 ***
#> X11          8.678e-04  5.424e-02   0.016    0.987    
#> X2          -9.261e-05  2.425e-03  -0.038    0.970    
#> X3          -1.450e-03  5.257e-03  -0.276    0.783    
#> X42          1.715e-02  5.650e-02   0.304    0.761    
#> X43          1.897e-01  4.816e-01   0.394    0.694    
#> X52          1.284e-01  3.044e-01   0.422    0.673    
#> X53          1.248e-01  2.995e-01   0.417    0.677    
#> X54          1.136e-01  2.834e-01   0.401    0.689    
#> X55          1.232e-01  3.904e-01   0.315    0.752    
#> X56          1.097e-01  3.005e-01   0.365    0.715    
#> X57          1.610e-01  3.018e-01   0.533    0.594    
#> X58          1.488e-01  2.890e-01   0.515    0.607    
#> X59          1.482e-01  2.968e-01   0.500    0.617    
#> X510         1.609e-01  3.415e-01   0.471    0.638    
#> X511         1.557e-01  2.942e-01   0.529    0.597    
#> X512         1.221e-01  6.258e-01   0.195    0.845    
#> X513         1.991e-01  2.952e-01   0.674    0.500    
#> X514         2.211e-01  3.008e-01   0.735    0.462    
#> X62         -5.000e-02  4.211e-01  -0.119    0.905    
#> X63         -7.690e-02  4.346e-01  -0.177    0.860    
#> X64         -6.665e-02  2.789e-01  -0.239    0.811    
#> X65         -5.029e-02  2.933e-01  -0.171    0.864    
#> X67          5.441e-03  3.656e-01   0.015    0.988    
#> X68         -7.305e-02  2.814e-01  -0.260    0.795    
#> X69         -1.321e-01  4.059e-01  -0.326    0.745    
#> X7           4.670e-03  8.174e-03   0.571    0.568    
#> X81          2.609e-01  5.799e-02   4.499 6.81e-06 ***
#> X91          3.587e-02  6.368e-02   0.563    0.573    
#> X10          1.979e-03  5.620e-03   0.352    0.725    
#> X111        -1.520e-02  4.805e-02  -0.316    0.752    
#> X122         1.698e-02  9.119e-02   0.186    0.852    
#> X123         2.728e-01  2.656e-01   1.027    0.304    
#> X13         -8.530e-05  1.584e-04  -0.539    0.590    
#> X14          1.540e-06  5.360e-06   0.287    0.774    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 82.13  on 481  degrees of freedom
#> Residual deviance: 29.86  on 447  degrees of freedom
#> AIC: 1200.7
#> 
#> Number of Fisher Scoring iterations: 4

Frequently Answers Questions (FAQs)

What is optimization?

Optimization is the process of modifying your parameters on your train model to improve the quality of your classification model. Based on your goals, optimization can involve ad implementation improvements or changes to your classification model. This package is focused in two questions, the threshold and several options.

Why optimize your classification model?

Optimizing your classification model is important when you want to completely achieve their potential. Through optimization, you can help improve the root mean square error (RMSE), grow the success rate, or accomplish others of your other goals (minimizing type I error or minimizing type II error).

How does Optim.LM optimize a Linear Model?

Optim.LM makes transformations of the response variable to improve the precision of the linear model. Then the function searches the best threshold to obtain the best result as possible to your goal.

Transformation included:

Linear Model without transformation (LM)
Square root Linear Model (SQRT.LM)
Logarithmic Linear Model (LOG.LM)

How does Optim.GLM optimize a Generalized Linear Model?

Optim.GLM tries to change around different types of error distributions (it called family in R) and several transformations of data (it called link in R). Then the function searches the best threshold to obtain the best result as possible to your goal.

Models trained with this functions:

gaussian family with identity link
binomial family with:
- logit link
- probit link
- cloglog link
poisson family with:
- log link
- sqrt link
- identity link

How does Optim.LMM optimize a Linear Mixed Model?

Optim.LMM searches which one of the variables can use as a random variable improving the model precision. Then the function searches the best threshold to obtain the best result as possible to your goal.

How does Optim.DA optimize a Discriminant Analysis?

Optim.DA tries to train a Quadratic and Linear Discriminant Analysis because sometimes it does not possible trains a QDA for data characteristics.

How does Optim.CART optimize a Decision Tree?

Optim.CART focuses on the pruning progress and compares several levels of pruning, for this progress uses a complexity parameter that It is the amount by which splitting that node improved the relative error.

How does Optim.NN optimize a Neural Network?

Optim.NN searches which the number of hidden layers improves the model precision. Then the function searches the best threshold to obtain the best result as possible to your goal.

How does Optim.SVM optimize a Support Vector Machine?

Optim.SVM tries to change around different types of kernels to improve the precision.Then the function searches the best threshold to obtain the best result as possible to your goal.

Kernels trained with this functions:

sigmoid kernel
radial kernel
polynomial kernel
linear kernel

Bugs and feature requests

If you find problems with the package, or there’s anything that it doesn’t do which you think it should, please submit them to https://github.com/economistgame/OptimClassifier/issues. In particular, let me know about optimizers and formats which you’d like supported, or if you have a workflow which might make sense for inclusion as a default convenience function.