The Regression()
function performs multiple facets of a complete regression analysis. Abbreviate with reg()
.
To illustrate, first read the Employee data included as part of lessR.
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line.
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 44140.971 13666.115 3.230 0.003 16337.052 71944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
##
##
## Standard deviation of residuals: 11753.478 for 33 degrees of freedom
##
## R-squared: 0.726 Adjusted R-squared: 0.710 PRESS R-squared: 0.659
##
## Null hypothesis that all population slope coefficients are 0:
## F-statistic: 43.827 df: 2 and 33 p-value: 0.000
##
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 87.641 0.000
## Pre 1 1639658.444 1639658.444 0.012 0.914
##
## Model 2 12108796948.736 6054398474.368 43.827 0.000
## Residuals 33 4558759843.773 138144237.690
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
## RELATIONS AMONG THE VARIABLES
##
## RESIDUALS AND INFLUENCE
##
## FORECASTING ERROR
The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals.
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 44140.971 13666.115 3.230 0.003 16337.052 71944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
##
##
## Standard deviation of residuals: 11753.478 for 33 degrees of freedom
##
## R-squared: 0.726 Adjusted R-squared: 0.710 PRESS R-squared: 0.659
##
## Null hypothesis that all population slope coefficients are 0:
## F-statistic: 43.827 df: 2 and 33 p-value: 0.000
##
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 87.641 0.000
## Pre 1 1639658.444 1639658.444 0.012 0.914
##
## Model 2 12108796948.736 6054398474.368 43.827 0.000
## Residuals 33 4558759843.773 138144237.690
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
## RELATIONS AMONG THE VARIABLES
##
## Salary Years Pre
## Salary 1.00 0.85 0.03
## Years 0.85 1.00 0.05
## Pre 0.03 0.05 1.00
##
##
## Tolerance VIF
## Years 0.998 1.002
## Pre 0.998 1.002
##
##
## Years Pre R2adj X's
## 1 0 0.718 1
## 1 1 0.710 2
## 0 1 -0.028 1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
##
## RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20, out of 36 rows of data, or do res_rows="all"]
## -----------------------------------------------------------------------------------------
## Years Pre Salary fitted resid rstdnt dffits cooks
## Correll, Trevon 21 97 134419.230 110648.843 23770.387 2.424 1.217 0.430
## James, Leslie 18 70 122563.380 101387.773 21175.607 1.998 0.714 0.156
## Capelle, Adam 24 83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132
## Hoang, Binh 15 96 111074.860 91158.659 19916.201 1.860 0.649 0.131
## Korhalkar, Jessica 2 74 72502.500 49292.181 23210.319 2.171 0.638 0.122
## Billing, Susan 4 91 72675.260 55484.493 17190.767 1.561 0.472 0.071
## Singh, Niral 2 59 61055.440 49566.155 11489.285 1.064 0.452 0.068
## Skrotzki, Sara 18 63 91352.330 101515.627 -10163.297 -0.937 -0.397 0.053
## Saechao, Suzanne 8 98 55545.250 68362.271 -12817.021 -1.157 -0.390 0.050
## Kralik, Laura 10 74 92681.190 75303.447 17377.743 1.535 0.287 0.026
## Anastasiou, Crystal 2 59 56508.320 49566.155 6942.165 0.636 0.270 0.025
## Langston, Matthew 5 94 49188.960 58681.106 -9492.146 -0.844 -0.268 0.024
## Afshari, Anbar 6 100 69441.930 61822.925 7619.005 0.689 0.264 0.024
## Cassinelli, Anastis 10 80 57562.360 75193.857 -17631.497 -1.554 -0.265 0.022
## Osterman, Pascal 5 69 49704.790 59137.730 -9432.940 -0.826 -0.216 0.016
## Bellingar, Samantha 10 67 66337.830 75431.301 -9093.471 -0.793 -0.198 0.013
## LaRoe, Maria 10 80 61961.290 75193.857 -13232.567 -1.148 -0.195 0.013
## Ritchie, Darnell 7 82 53788.260 65403.102 -11614.842 -1.006 -0.190 0.012
## Sheppard, Cory 14 66 95027.550 88455.199 6572.351 0.579 0.176 0.011
## Downs, Deborah 7 90 57139.900 65256.982 -8117.082 -0.706 -0.174 0.010
##
##
## FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals do pred_rows="all"]
## --------------------------------------------------------------------------------------------------
## Years Pre Salary pred sf pi:lwr pi:upr width
## Hamide, Bita 1 83 51036.850 45876.388 12290.483 20871.211 70881.564 50010.352
## Singh, Niral 2 59 61055.440 49566.155 12619.291 23892.014 75240.296 51348.281
## Anastasiou, Crystal 2 59 56508.320 49566.155 12619.291 23892.014 75240.296 51348.281
## ...
## Link, Thomas 10 83 66312.890 75139.062 11933.518 50860.137 99417.987 48557.849
## LaRoe, Maria 10 80 61961.290 75193.857 11918.048 50946.405 99441.308 48494.903
## Cassinelli, Anastis 10 80 57562.360 75193.857 11918.048 50946.405 99441.308 48494.903
## ...
## Correll, Trevon 21 97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747
## Capelle, Adam 24 83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767
##
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: ScatterPlot Matrix
## ----------------------------------
The standard output includes $R^2_{press}, the value of \(R^2\) when applied to new, previously unseen data. Still, a cross-validation option is also offered with the kfold
parameter. Here specify three folds.
## K-FOLD CROSS-VALIDATION
##
## Model from Training Data Applied to Testing Data
## ---------------------------------- ----------------------------------
## fold n se MSE Rsq n se MSE Rsq
## 1 | 24 12273.934 150649453.294 0.731 | 12 11306.800 127843727.961 0.703
## 2 | 24 10936.028 119596701.753 0.777 | 12 14446.144 208691069.124 0.571
## 3 | 24 11646.282 135635890.275 0.676 | 12 12965.769 168111155.301 0.774
## ---------------------------------- ----------------------------------
## Mean 11618.748 135294015.107 0.728 12906.237 168215317.462 0.683
The output of Regression()
can be stored into an R object, here named r. The output object consists of various components.
Entering the name of the object displays the full output.
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 44140.971 13666.115 3.230 0.003 16337.052 71944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
##
##
## Standard deviation of residuals: 11753.478 for 33 degrees of freedom
##
## R-squared: 0.726 Adjusted R-squared: 0.710 PRESS R-squared: 0.659
##
## Null hypothesis that all population slope coefficients are 0:
## F-statistic: 43.827 df: 2 and 33 p-value: 0.000
##
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 87.641 0.000
## Pre 1 1639658.444 1639658.444 0.012 0.914
##
## Model 2 12108796948.736 6054398474.368 43.827 0.000
## Residuals 33 4558759843.773 138144237.690
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
## RELATIONS AMONG THE VARIABLES
##
## Salary Years Pre
## Salary 1.00 0.85 0.03
## Years 0.85 1.00 0.05
## Pre 0.03 0.05 1.00
##
##
## Tolerance VIF
## Years 0.998 1.002
## Pre 0.998 1.002
##
##
## Years Pre R2adj X's
## 1 0 0.718 1
## 1 1 0.710 2
## 0 1 -0.028 1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
##
## RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20, out of 36 rows of data, or do res_rows="all"]
## -----------------------------------------------------------------------------------------
## Years Pre Salary fitted resid rstdnt dffits cooks
## Correll, Trevon 21 97 134419.230 110648.843 23770.387 2.424 1.217 0.430
## James, Leslie 18 70 122563.380 101387.773 21175.607 1.998 0.714 0.156
## Capelle, Adam 24 83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132
## Hoang, Binh 15 96 111074.860 91158.659 19916.201 1.860 0.649 0.131
## Korhalkar, Jessica 2 74 72502.500 49292.181 23210.319 2.171 0.638 0.122
## Billing, Susan 4 91 72675.260 55484.493 17190.767 1.561 0.472 0.071
## Singh, Niral 2 59 61055.440 49566.155 11489.285 1.064 0.452 0.068
## Skrotzki, Sara 18 63 91352.330 101515.627 -10163.297 -0.937 -0.397 0.053
## Saechao, Suzanne 8 98 55545.250 68362.271 -12817.021 -1.157 -0.390 0.050
## Kralik, Laura 10 74 92681.190 75303.447 17377.743 1.535 0.287 0.026
## Anastasiou, Crystal 2 59 56508.320 49566.155 6942.165 0.636 0.270 0.025
## Langston, Matthew 5 94 49188.960 58681.106 -9492.146 -0.844 -0.268 0.024
## Afshari, Anbar 6 100 69441.930 61822.925 7619.005 0.689 0.264 0.024
## Cassinelli, Anastis 10 80 57562.360 75193.857 -17631.497 -1.554 -0.265 0.022
## Osterman, Pascal 5 69 49704.790 59137.730 -9432.940 -0.826 -0.216 0.016
## Bellingar, Samantha 10 67 66337.830 75431.301 -9093.471 -0.793 -0.198 0.013
## LaRoe, Maria 10 80 61961.290 75193.857 -13232.567 -1.148 -0.195 0.013
## Ritchie, Darnell 7 82 53788.260 65403.102 -11614.842 -1.006 -0.190 0.012
## Sheppard, Cory 14 66 95027.550 88455.199 6572.351 0.579 0.176 0.011
## Downs, Deborah 7 90 57139.900 65256.982 -8117.082 -0.706 -0.174 0.010
##
##
## FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals do pred_rows="all"]
## --------------------------------------------------------------------------------------------------
## Years Pre Salary pred sf pi:lwr pi:upr width
## Hamide, Bita 1 83 51036.850 45876.388 12290.483 20871.211 70881.564 50010.352
## Singh, Niral 2 59 61055.440 49566.155 12619.291 23892.014 75240.296 51348.281
## Anastasiou, Crystal 2 59 56508.320 49566.155 12619.291 23892.014 75240.296 51348.281
## ...
## Link, Thomas 10 83 66312.890 75139.062 11933.518 50860.137 99417.987 48557.849
## LaRoe, Maria 10 80 61961.290 75193.857 11918.048 50946.405 99441.308 48494.903
## Cassinelli, Anastis 10 80 57562.360 75193.857 11918.048 50946.405 99441.308 48494.903
## ...
## Correll, Trevon 21 97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747
## Capelle, Adam 24 83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767
##
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: ScatterPlot Matrix
## ----------------------------------
Or, work with the components individually. Use the base R names()
function to identify all of the components. Component names that begin with out_
are part of the standard output. Other components include just data and statistics designed to be input in additional procedures.
## [1] "out_suggest" "call" "formula" "out_title_bck" "out_background" "out_title_basic"
## [7] "out_estimates" "out_fit" "out_anova" "out_title_kfold" "out_kfold" "out_title_rel"
## [13] "out_cor" "out_collinear" "out_subsets" "out_title_res" "out_residuals" "out_title_pred"
## [19] "out_predict" "out_ref" "out_Rmd" "out_Word" "out_pdf" "out_odt"
## [25] "out_rtf" "out_plots" "n.vars" "n.obs" "n.keep" "coefficients"
## [31] "sterrs" "tvalues" "pvalues" "cilb" "ciub" "anova_model"
## [37] "anova_residual" "anova_total" "se" "resid_range" "Rsq" "Rsqadj"
## [43] "PRESS" "RsqPRESS" "m_se" "m_MSE" "m_Rsq" "cor"
## [49] "tolerances" "vif" "resid.max" "pred_min_max" "residuals" "fitted"
## [55] "cooks.distance" "model" "terms"
Here just display the estimates as part of the standard output.
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 44140.971 13666.115 3.230 0.003 16337.052 71944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
Here display the coefficients.
## (Intercept) Years Pre
## 44140.97140 3251.40825 -18.26496
The parameter Rmd
creates an R markdown file that is automatically generated and html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.
reg(Salary ~ Years + Pre, Rmd="eg")
##
## Response Variable: Gender
## Predictor Variable 1: Salary
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 37
##
##
##
## BASIC ANALYSIS
##
## Model Coefficients
##
## Estimate Std Err z-value p-value Lower 95% Upper 95%
## (Intercept) -2.6191 1.3715 -1.910 0.056 -5.3073 0.0691
## Salary 0.0000 0.0000 1.904 0.057 -0.0000 0.0001
##
##
## Odds ratios and confidence intervals
##
## Odds Ratio Lower 95% Upper 95%
## (Intercept) 0.0729 0.0050 1.0715
## Salary 1.0000 1.0000 1.0001
##
##
## Model Fit
##
## Null deviance: 51.266 on 36 degrees of freedom
## Residual deviance: 46.918 on 35 degrees of freedom
##
## AIC: 50.91807
##
## Number of iterations to convergence: 4
##
##
##
##
## ANALYSIS OF RESIDUALS AND INFLUENCE
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20 out of 37 cases (rows) of data]
## --------------------------------------------------------------------
## Salary Gender fitted residual rstudent dffits cooks
## James, Leslie 122563 F 0.8424 -0.8424 -2.1213 -0.7143 0.46299
## Langston, Matthew 49189 M 0.2900 0.7100 1.6237 0.3646 0.08559
## Osterman, Pascal 49705 M 0.2938 0.7062 1.6139 0.3586 0.08225
## Kralik, Laura 92681 F 0.6522 -0.6522 -1.4942 -0.3313 0.06402
## Ritchie, Darnell 53788 M 0.3243 0.6757 1.5380 0.3136 0.05962
## Skrotzki, Sara 91352 F 0.6416 -0.6416 -1.4698 -0.3161 0.05736
## Cassinelli, Anastis 57562 M 0.3539 0.6461 1.4703 0.2761 0.04409
## Link, Thomas 66313 M 0.4267 0.5733 1.3223 0.2111 0.02335
## Anderson, David 69548 M 0.4547 0.5453 1.2706 0.1967 0.01962
## Stanley, Grayson 69625 M 0.4553 0.5447 1.2694 0.1965 0.01955
## Capelle, Adam 108138 M 0.7632 0.2368 0.7586 0.2236 0.01954
## Knox, Michael 99063 M 0.7011 0.2989 0.8637 0.2179 0.01935
## Hoang, Binh 111075 M 0.7813 0.2187 0.7265 0.2228 0.01919
## Sheppard, Cory 95028 M 0.6706 0.3294 0.9132 0.2119 0.01869
## Wu, James 94495 M 0.6665 0.3335 0.9199 0.2110 0.01859
## Campagna, Justin 72321 M 0.4788 0.5212 1.2275 0.1888 0.01759
## Fulton, Scott 87786 M 0.6124 0.3876 1.0066 0.1980 0.01706
## Adib, Hassan 83014 M 0.5720 0.4280 1.0715 0.1892 0.01613
## Pham, Scott 81871 M 0.5622 0.4378 1.0875 0.1875 0.01599
## Portlock, Ryan 77715 M 0.5261 0.4739 1.1469 0.1841 0.01593
##
##
## FORECASTS
##
## Probability threshold for predicting M: 0.5
##
## 0: F
## 1: M
##
## Data, Fitted Values, Standard Errors
## [sorted by fitted value]
## --------------------------------------------------------------------
## Salary Gender predict fitted std.err
## Stanley, Emma 46125 F 0 0.2684 0.1161
## Langston, Matthew 49189 M 0 0.2900 0.1126
## Osterman, Pascal 49705 M 0 0.2938 0.1119
## Gvakharia, Kimberly 49869 F 0 0.2949 0.1117
##
## ... for the rows of data where fitted is close to 0.5 ...
##
## Salary Gender predict fitted std.err
## Campagna, Justin 72321 M 0 0.4788 0.08710
## Korhalkar, Jessica 72502 F 0 0.4804 0.08713
## Billing, Susan 72675 F 0 0.4819 0.08718
## Portlock, Ryan 77715 M 1 0.5261 0.09079
## Pham, Scott 81871 M 1 0.5622 0.09670
##
## ... for the last 4 rows of sorted data ...
##
## Salary Gender predict fitted std.err
## Capelle, Adam 108138 M 1 0.7632 0.1355
## Hoang, Binh 111075 M 1 0.7813 0.1364
## James, Leslie 122563 F 1 0.8424 0.1318
## Correll, Trevon 134419 M 1 0.8901 0.1174
## --------------------------------------------------------------------
##
##
## Confusion Matrix for Gender
##
## Probability threshold for predicting M: 0.5
##
## Baseline Predicted
## ---------------------------------------------------
## Total %Tot 0 1 %Correct
## ---------------------------------------------------
## 0 19 51.4 16 3 84.2
## Gender 1 18 48.6 8 10 55.6
## ---------------------------------------------------
## Total 37 70.3
##
## Accuracy: 70.27
## Recall: 55.56
## Precision: 76.92
Specify multiple logistic regression with the usual R formula syntax. Specify additional probability thresholds beyond just the default 0.5 with the prob_cut
parameter.
Logit(Gender ~ Years + Salary, prob_cut=c(.3, .5 .7))
Use the base R help()
function to view the full manual for Regression()
. Simply enter a question mark followed by the name of the function, or its abbreviation.
?reg