OptimClassifier provides a set of tools for creating models, selecting the best parameters combination for a model, and select the best threshold for your binary classification. The package contains tools for:
Optim. | |||||||
---|---|---|---|---|---|---|---|
Method |
|
||||||
Threshold optimization | β | β | β | βοΈ* | βοΈ* | β | β |
Parameter Optimization | β | β | β | β * | β * | β | β |
What parameter or option? | Transformations | Family & Links | Random variable | Linear or Quadratic | CP | Hidden layers | Kernels |
*These models are natively classifiers.
The example shows you how to solve a common credit scoring problem with this package and GLM methodology.
Firstly, we must load the dataset. In this example, we use Australian Credit.
Then we create a model with the Optim.GLM function (or the one you want).
## Create the model
creditscoring <- Optim.GLM(Y~., AustralianCredit, p = 0.7, seed=2018)
#> Warning: Thresholds' criteria not selected. The success rate is defined as the default.
#>
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Now you can print the results of the models
### See a ranking of the models tested
print(creditscoring)
#> 7 successful models have been tested and 21 thresholds evaluated
#>
#> Model rmse Threshold success_rate ti_error tii_error
#> 1 poisson(log) 0.3389622 0.50 0.8365385 0.01442308 0.1490385
#> 2 poisson(sqrt) 0.3409605 0.55 0.8365385 0.01442308 0.1490385
#> 3 gaussian 0.3425880 0.60 0.8413462 0.01923077 0.1394231
#> 4 poisson 0.3430554 0.55 0.8365385 0.01442308 0.1490385
#> 5 binomial(cloglog) 0.3519640 0.35 0.8269231 0.01923077 0.1538462
#> 6 binomial(probit) 0.3587217 0.45 0.8221154 0.02403846 0.1538462
#> 7 binomial(logit) 0.3595036 0.35 0.8173077 0.01923077 0.1634615
Are you see with a graphic? Try to typping plot(creditscoring)
But what is the information (coefficients and others things) of the best model? And the secondth in the rank list?. Simply we can see:
### Access to summary of the best model
summary(creditscoring)
#>
#> Call:
#> stats::glm(formula = formula, family = rowfamily, data = training,
#> model = FALSE, x = FALSE, y = FALSE)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.73227 -0.08182 -0.00412 0.10131 0.91599
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -8.844e-02 2.761e-01 -0.320 0.749
#> X11 5.292e-05 9.104e-02 0.001 1.000
#> X2 -1.919e-04 4.000e-03 -0.048 0.962
#> X3 -2.322e-03 8.483e-03 -0.274 0.784
#> X42 3.292e-02 9.671e-02 0.340 0.734
#> X43 3.393e-01 7.245e-01 0.468 0.640
#> X52 2.484e-01 5.225e-01 0.475 0.634
#> X53 2.495e-01 5.126e-01 0.487 0.626
#> X54 2.198e-01 4.875e-01 0.451 0.652
#> X55 2.165e-01 6.225e-01 0.348 0.728
#> X56 2.181e-01 5.141e-01 0.424 0.671
#> X57 3.087e-01 5.140e-01 0.601 0.548
#> X58 2.840e-01 4.915e-01 0.578 0.563
#> X59 2.887e-01 5.053e-01 0.571 0.568
#> X510 3.009e-01 5.611e-01 0.536 0.592
#> X511 2.911e-01 4.994e-01 0.583 0.560
#> X512 2.102e-01 9.427e-01 0.223 0.824
#> X513 3.624e-01 4.964e-01 0.730 0.465
#> X514 3.864e-01 5.059e-01 0.764 0.445
#> X62 -9.621e-02 6.912e-01 -0.139 0.889
#> X63 -1.236e-01 7.016e-01 -0.176 0.860
#> X64 -1.320e-01 4.755e-01 -0.278 0.781
#> X65 -1.132e-01 4.977e-01 -0.228 0.820
#> X67 -6.857e-04 6.103e-01 -0.001 0.999
#> X68 -1.445e-01 4.777e-01 -0.302 0.762
#> X69 -2.363e-01 6.580e-01 -0.359 0.720
#> X7 6.848e-03 1.244e-02 0.551 0.582
#> X81 4.490e-01 9.878e-02 4.545 5.48e-06 ***
#> X91 6.807e-02 1.029e-01 0.661 0.508
#> X10 2.183e-03 8.015e-03 0.272 0.785
#> X111 -2.626e-02 7.958e-02 -0.330 0.741
#> X122 3.305e-02 1.626e-01 0.203 0.839
#> X123 4.749e-01 4.169e-01 1.139 0.255
#> X13 -1.577e-04 2.677e-04 -0.589 0.556
#> X14 2.016e-06 7.898e-06 0.255 0.799
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 82.130 on 481 degrees of freedom
#> Residual deviance: 29.559 on 447 degrees of freedom
#> AIC: 1200.4
#>
#> Number of Fisher Scoring iterations: 4
### Access to summary of the secondth model
summary(creditscoring,2)
#>
#> Call:
#> stats::glm(formula = formula, family = rowfamily, data = training,
#> model = FALSE, x = FALSE, y = FALSE)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.73617 -0.08578 0.00034 0.11181 0.92131
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 9.519e-01 1.563e-01 6.092 1.12e-09 ***
#> X11 8.678e-04 5.424e-02 0.016 0.987
#> X2 -9.261e-05 2.425e-03 -0.038 0.970
#> X3 -1.450e-03 5.257e-03 -0.276 0.783
#> X42 1.715e-02 5.650e-02 0.304 0.761
#> X43 1.897e-01 4.816e-01 0.394 0.694
#> X52 1.284e-01 3.044e-01 0.422 0.673
#> X53 1.248e-01 2.995e-01 0.417 0.677
#> X54 1.136e-01 2.834e-01 0.401 0.689
#> X55 1.232e-01 3.904e-01 0.315 0.752
#> X56 1.097e-01 3.005e-01 0.365 0.715
#> X57 1.610e-01 3.018e-01 0.533 0.594
#> X58 1.488e-01 2.890e-01 0.515 0.607
#> X59 1.482e-01 2.968e-01 0.500 0.617
#> X510 1.609e-01 3.415e-01 0.471 0.638
#> X511 1.557e-01 2.942e-01 0.529 0.597
#> X512 1.221e-01 6.258e-01 0.195 0.845
#> X513 1.991e-01 2.952e-01 0.674 0.500
#> X514 2.211e-01 3.008e-01 0.735 0.462
#> X62 -5.000e-02 4.211e-01 -0.119 0.905
#> X63 -7.690e-02 4.346e-01 -0.177 0.860
#> X64 -6.665e-02 2.789e-01 -0.239 0.811
#> X65 -5.029e-02 2.933e-01 -0.171 0.864
#> X67 5.441e-03 3.656e-01 0.015 0.988
#> X68 -7.305e-02 2.814e-01 -0.260 0.795
#> X69 -1.321e-01 4.059e-01 -0.326 0.745
#> X7 4.670e-03 8.174e-03 0.571 0.568
#> X81 2.609e-01 5.799e-02 4.499 6.81e-06 ***
#> X91 3.587e-02 6.368e-02 0.563 0.573
#> X10 1.979e-03 5.620e-03 0.352 0.725
#> X111 -1.520e-02 4.805e-02 -0.316 0.752
#> X122 1.698e-02 9.119e-02 0.186 0.852
#> X123 2.728e-01 2.656e-01 1.027 0.304
#> X13 -8.530e-05 1.584e-04 -0.539 0.590
#> X14 1.540e-06 5.360e-06 0.287 0.774
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 82.13 on 481 degrees of freedom
#> Residual deviance: 29.86 on 447 degrees of freedom
#> AIC: 1200.7
#>
#> Number of Fisher Scoring iterations: 4
Optimization is the process of modifying your parameters on your train model to improve the quality of your classification model. Based on your goals, optimization can involve ad implementation improvements or changes to your classification model. This package is focused in two questions, the threshold and several options.
Optimizing your classification model is important when you want to completely achieve their potential. Through optimization, you can help improve the root mean square error (RMSE), grow the success rate, or accomplish others of your other goals (minimizing type I error or minimizing type II error).
Optim.LM makes transformations of the response variable to improve the precision of the linear model. Then the function searches the best threshold to obtain the best result as possible to your goal.
Transformation included:
Optim.GLM tries to change around different types of error distributions (it called family in R) and several transformations of data (it called link in R). Then the function searches the best threshold to obtain the best result as possible to your goal.
Models trained with this functions:
Optim.LMM searches which one of the variables can use as a random variable improving the model precision. Then the function searches the best threshold to obtain the best result as possible to your goal.
Optim.DA tries to train a Quadratic and Linear Discriminant Analysis because sometimes it does not possible trains a QDA for data characteristics.
Optim.CART focuses on the pruning progress and compares several levels of pruning, for this progress uses a complexity parameter that It is the amount by which splitting that node improved the relative error.
Optim.NN searches which the number of hidden layers improves the model precision. Then the function searches the best threshold to obtain the best result as possible to your goal.
Optim.SVM tries to change around different types of kernels to improve the precision.Then the function searches the best threshold to obtain the best result as possible to your goal.
Kernels trained with this functions:
If you find problems with the package, or thereβs anything that it doesnβt do which you think it should, please submit them to https://github.com/economistgame/OptimClassifier/issues. In particular, let me know about optimizers and formats which youβd like supported, or if you have a workflow which might make sense for inclusion as a default convenience function.