title: “Using Outlier Detection Algorithms to Analyze NBA Players” author: “Cheng Fan” date: “April 1, 2015”

output: html_document

#Section 1: Overview

This is a short tutorial on using outlier detection algorithms to analyze NBA players. The data to be used is called GoldenStatesWarriors, which has been included in the package. The data contains the statistics of all players in the Golden States Warriors during the 2013-2014 season. In total, 18 players are included. The data contains 27 variables, which specifies the following attributes, player names (Name), age (Age), games played (G), games started (GS), minutes played (MP), field goal made (FG), field goal attemps (FGA), field goal percentage (FGP), 3-pointers made (3P), 3-pointer attemps (3PA), 3-pointer percentage (3PP), 2-pointers made (2P), 2-pointer attemps (2PA), 2-pointer percentages (2PP), effective field goal percentage (eFGP), free throws made (FT), free throw attemps (FTA), free throw percentage (FTP), offensive rebounds (ORB), defensive rebounds (DRB), total rebounds (TRB), assists (AST), steals (STL), blocks (BLK), turnovers (TOV), personal fouls (PF), and total points (PTS).

The summary of the data are shown as below:

##      Name                Age              G               GS       
##  Length:18          Min.   :21.00   Min.   : 4.00   Min.   : 0.00  
##  Class :character   1st Qu.:23.25   1st Qu.:24.00   1st Qu.: 0.00  
##  Mode  :character   Median :25.00   Median :44.00   Median : 2.00  
##                     Mean   :26.33   Mean   :47.22   Mean   :22.78  
##                     3rd Qu.:29.00   3rd Qu.:75.75   3rd Qu.:53.25  
##                     Max.   :35.00   Max.   :82.00   Max.   :81.00  
##                                                                    
##        MP               FG              FGA               FGP        
##  Min.   :   6.0   Min.   :  0.00   Min.   :   1.00   Min.   :0.0000  
##  1st Qu.: 172.8   1st Qu.: 14.75   1st Qu.:  50.75   1st Qu.:0.3775  
##  Median : 772.0   Median :132.00   Median : 290.50   Median :0.4120  
##  Mean   :1101.7   Mean   :179.78   Mean   : 389.17   Mean   :0.4044  
##  3rd Qu.:1979.2   3rd Qu.:231.25   3rd Qu.: 458.75   3rd Qu.:0.4733  
##  Max.   :2868.0   Max.   :652.00   Max.   :1383.00   Max.   :0.6270  
##                                                                      
##        3P              3PA              3PP               2P       
##  Min.   :  0.00   Min.   :  0.00   Min.   :0.0000   Min.   :  0.0  
##  1st Qu.:  0.00   1st Qu.:  0.25   1st Qu.:0.2560   1st Qu.: 10.0  
##  Median :  9.00   Median : 35.00   Median :0.3220   Median :112.0  
##  Mean   : 43.00   Mean   :113.17   Mean   :0.2718   Mean   :136.8  
##  3rd Qu.: 51.25   3rd Qu.:155.75   3rd Qu.:0.3470   3rd Qu.:200.5  
##  Max.   :261.00   Max.   :615.00   Max.   :0.4240   Max.   :513.0  
##                                    NA's   :5                       
##       2PA             2PP              eFGP              FT        
##  Min.   :  1.0   Min.   :0.0000   Min.   :0.0000   Min.   :  1.00  
##  1st Qu.: 22.5   1st Qu.:0.4412   1st Qu.:0.4293   1st Qu.:  5.50  
##  Median :226.5   Median :0.4595   Median :0.4755   Median : 29.00  
##  Mean   :276.0   Mean   :0.4395   Mean   :0.4458   Mean   : 72.39  
##  3rd Qu.:402.0   3rd Qu.:0.4988   3rd Qu.:0.5182   3rd Qu.:107.25  
##  Max.   :980.0   Max.   :0.6270   Max.   :0.6270   Max.   :308.00  
##                                                                    
##       FTA              FTP              ORB              DRB       
##  Min.   :  2.00   Min.   :0.3440   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:  8.75   1st Qu.:0.5523   1st Qu.:  5.00   1st Qu.: 23.5  
##  Median : 53.50   Median :0.6925   Median : 28.50   Median :105.0  
##  Mean   : 96.17   Mean   :0.6692   Mean   : 49.78   Mean   :156.6  
##  3rd Qu.:133.50   3rd Qu.:0.7913   3rd Qu.: 80.25   3rd Qu.:243.8  
##  Max.   :348.00   Max.   :0.8850   Max.   :182.00   Max.   :489.0  
##                                                                    
##       TRB             AST             STL              BLK        
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 29.5   1st Qu.: 14.5   1st Qu.:  4.50   1st Qu.:  2.50  
##  Median :152.5   Median : 44.5   Median : 13.50   Median : 10.00  
##  Mean   :206.4   Mean   :106.2   Mean   : 35.67   Mean   : 22.61  
##  3rd Qu.:306.5   3rd Qu.:139.2   3rd Qu.: 60.75   3rd Qu.: 32.00  
##  Max.   :671.0   Max.   :666.0   Max.   :128.00   Max.   :121.00  
##                                                                   
##       TOV               PF              PTS        
##  Min.   :  0.00   Min.   :  0.00   Min.   :   1.0  
##  1st Qu.: 13.75   1st Qu.: 21.25   1st Qu.:  41.5  
##  Median : 51.00   Median : 73.00   Median : 351.5  
##  Mean   : 68.06   Mean   : 99.11   Mean   : 474.9  
##  3rd Qu.: 95.50   3rd Qu.:185.00   3rd Qu.: 568.8  
##  Max.   :294.00   Max.   :234.00   Max.   :1873.0  
## 

#Section 2: Data preprocessing

It is obvious that the variables in the data have different scales. Prior to the implementation of outlier detection algorithms, it is important to normalize the raw data. A Z-normalization is used for data preprocessing (except for the first column, which is the player name). One thing to note is that the raw data contains NAs. This is because for certain players, they did not make any attemps on 3-pointers and therefore, the resulting 3-pointer percentage is NA. Once the data is scaled, these missing values are filled as 0.

library(plyr)
data.scale <- t(aaply(.data = as.matrix(GoldenStatesWarriors[,-1]), .margins = 2, .fun = function(x) (x-mean(x, na.rm = T))/sd(x, na.rm = T)))
summary(data.scale)
##       Age                G                 GS                MP         
##  Min.   :-1.3690   Min.   :-1.5584   Min.   :-0.7171   Min.   :-1.0701  
##  1st Qu.:-0.7915   1st Qu.:-0.8373   1st Qu.:-0.7171   1st Qu.:-0.9073  
##  Median :-0.3423   Median :-0.1162   Median :-0.6541   Median :-0.3220  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6845   3rd Qu.: 1.0286   3rd Qu.: 0.9593   3rd Qu.: 0.8571  
##  Max.   : 2.2247   Max.   : 1.2539   Max.   : 1.8329   Max.   : 1.7251  
##                                                                         
##        FG               FGA               FGP                3P         
##  Min.   :-0.8806   Min.   :-0.9007   Min.   :-3.0511   Min.   :-0.5645  
##  1st Qu.:-0.8084   1st Qu.:-0.7853   1st Qu.:-0.2033   1st Qu.:-0.5645  
##  Median :-0.2340   Median :-0.2289   Median : 0.0570   Median :-0.4463  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2521   3rd Qu.: 0.1615   3rd Qu.: 0.5191   3rd Qu.: 0.1083  
##  Max.   : 2.3131   Max.   : 2.3061   Max.   : 1.6789   Max.   : 2.8617  
##                                                                         
##       3PA               3PP                2P               2PA         
##  Min.   :-0.6253   Min.   :-1.9685   Min.   :-0.8990   Min.   :-0.9170  
##  1st Qu.:-0.6239   1st Qu.:-0.1142   1st Qu.:-0.8332   1st Qu.:-0.8453  
##  Median :-0.4319   Median : 0.3638   Median :-0.1628   Median :-0.1651  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2353   3rd Qu.: 0.5449   3rd Qu.: 0.4188   3rd Qu.: 0.4202  
##  Max.   : 2.7729   Max.   : 1.1026   Max.   : 2.4727   Max.   : 2.3476  
##                    NA's   :5                                            
##       2PP                eFGP               FT               FTA         
##  Min.   :-3.26327   Min.   :-3.2123   Min.   :-0.8140   Min.   :-0.9027  
##  1st Qu.: 0.01299   1st Qu.:-0.1195   1st Qu.:-0.7627   1st Qu.:-0.8380  
##  Median : 0.14850   Median : 0.2138   Median :-0.4947   Median :-0.4090  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.43993   3rd Qu.: 0.5218   3rd Qu.: 0.3975   3rd Qu.: 0.3579  
##  Max.   : 1.39218   Max.   : 1.3053   Max.   : 2.6865   Max.   : 2.4141  
##                                                                          
##       FTP               ORB               DRB               TRB         
##  Min.   :-2.0177   Min.   :-0.8535   Min.   :-0.9869   Min.   :-0.9664  
##  1st Qu.:-0.7255   1st Qu.:-0.7677   1st Qu.:-0.8388   1st Qu.:-0.8283  
##  Median : 0.1448   Median :-0.3648   Median :-0.3252   Median :-0.2523  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7575   3rd Qu.: 0.5225   3rd Qu.: 0.5491   3rd Qu.: 0.4688  
##  Max.   : 1.3393   Max.   : 2.2670   Max.   : 2.0945   Max.   : 2.1756  
##                                                                         
##       AST               STL               BLK               TOV         
##  Min.   :-0.6684   Min.   :-0.8757   Min.   :-0.7267   Min.   :-0.9301  
##  1st Qu.:-0.5771   1st Qu.:-0.7652   1st Qu.:-0.6464   1st Qu.:-0.7422  
##  Median :-0.3884   Median :-0.5443   Median :-0.4053   Median :-0.2331  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2078   3rd Qu.: 0.6159   3rd Qu.: 0.3018   3rd Qu.: 0.3751  
##  Max.   : 3.5223   Max.   : 2.2670   Max.   : 3.1623   Max.   : 3.0880  
##                                                                         
##        PF               PTS         
##  Min.   :-1.1330   Min.   :-0.8590  
##  1st Qu.:-0.8901   1st Qu.:-0.7856  
##  Median :-0.2985   Median :-0.2237  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.9818   3rd Qu.: 0.1700  
##  Max.   : 1.5420   Max.   : 2.5339  
## 
data.scale[is.na(data.scale)] <- 0

#Section 3: Use of outlier detection algorithms

The basic version of ABOD use all the other observations in the data to evaluate the outlierness of a certain observation. As a result, the computation can be very time-consuming. The approximated version of ABOD use a subset to evaluate the outlierness. Even though the speed performance can be improved, the results sometimes may not be that reliable. For this data set, the size is rather small. The basic version of ABOD is used for outlier detection. The raw scores of ABOD are transformed into probabilities using the outlier unification scheme, which is proposed by Kriegel, Kroger, Schubert, and Zimek in 2011.

score.ABOD <- Func.ABOD(data = data.scale, basic = T)
score.trans.ABOD <- Func.trans(raw.score = score.ABOD, method = "ABOD")

The top 5 identified outliers are Stephen Curry, Dewayne Dedmon, Andrew Bogut, David Lee, and Klay Thompson. Except for Dewayne Dedmon, all the other 4 players are the starting lineups for Golden States Warriors during that season. Stephen Curry and Klay Thompson are selected as outliers due to their excellent scoring abilities and shooting skills. In addition, Curry has the highest assists rank in the team. Andrew Bogut is a defensive anchor, who contributes a lot to the rebounds and blocks. David Lee shows his importance on both end of the floors. He can scores, as well as rebounding. The other oultier is Dewayne Dedmon. This is reasonable as he only played 4 games for Warriors and his performance was not very impressive.

GoldenStatesWarriors$Name[order(score.trans.ABOD, decreasing = T)[1:5]]
## [1] "Stephen Curry"  "Dewayne Dedmon" "Andrew Bogut"   "David Lee"     
## [5] "Klay Thompson"
GoldenStatesWarriors[order(score.trans.ABOD, decreasing = T)[1:5],]
##              Name Age  G GS   MP  FG  FGA   FGP  3P 3PA   3PP  2P 2PA
## 2   Stephen Curry  25 78 78 2846 652 1383 0.471 261 615 0.424 391 768
## 18 Dewayne Dedmon  24  4  0    6   0    1 0.000   0   0    NA   0   1
## 7    Andrew Bogut  29 67 67 1769 235  375 0.627   0   0    NA 235 375
## 3       David Lee  30 69 67 2288 513  981 0.523   0   1 0.000 513 980
## 1   Klay Thompson  23 81 81 2868 559 1259 0.444 223 535 0.417 336 724
##      2PP  eFGP  FT FTA   FTP ORB DRB TRB AST STL BLK TOV  PF  PTS
## 2  0.509 0.566 308 348 0.885  46 288 334 666 128  14 294 194 1873
## 18 0.000 0.000   1   2 0.500   0   0   0   0   0   0   0   1    1
## 7  0.627 0.627  22  64 0.344 182 489 671 112  47 121  97 210  492
## 3  0.523 0.523 231 296 0.780 182 461 643 147  48  26 152 206 1257
## 1  0.464 0.533 147 185 0.795  38 211 249 181  74  37 135 234 1488

Next, the SOD algorithm is used. Considering the data size, 5 will be reasonable to construct the reference set. k.nn is set as 10. Note that k.nn should be larger than k.sel. alpha is set as default, i.e., 0.8. This default is recommended by the authors of this algorithm.

The top 5 outliers revealed are quite similar to those identified by ABOD. The only difference is that instead of picking the bench player, the SOD algorithm chooses another important player in the Warrior team, Draymond Green. Even though Green was not very often chosen as the starting lineup, he has played even more minutes than the starting Bogut. His role is the 6th man in the team. During the 2013-2014 team, Green really came out strong. As a small forward, he contributed a lot to the defensive side, such as rebounding and blocking. In addition, he could also contribute to the offensive side. So to me, it is very reasonable to select Green as one of the outliers in the Warrior team.

score.SOD <- Func.SOD(data = data.scale, k.nn = 10, k.sel = 5, alpha = .8)
score.trans.SOD <- Func.trans(raw.score = score.SOD, method = "SOD")

GoldenStatesWarriors$Name[order(score.trans.SOD, decreasing = T)[1:5]]
## [1] "Stephen Curry"  "Andrew Bogut"   "David Lee"      "Draymond Green"
## [5] "Klay Thompson"
GoldenStatesWarriors[order(score.trans.SOD, decreasing = T)[1:5],]
##             Name Age  G GS   MP  FG  FGA   FGP  3P 3PA   3PP  2P 2PA   2PP
## 2  Stephen Curry  25 78 78 2846 652 1383 0.471 261 615 0.424 391 768 0.509
## 7   Andrew Bogut  29 67 67 1769 235  375 0.627   0   0    NA 235 375 0.627
## 3      David Lee  30 69 67 2288 513  981 0.523   0   1 0.000 513 980 0.523
## 6 Draymond Green  23 82 12 1797 187  459 0.407  55 165 0.333 132 294 0.449
## 1  Klay Thompson  23 81 81 2868 559 1259 0.444 223 535 0.417 336 724 0.464
##    eFGP  FT FTA   FTP ORB DRB TRB AST STL BLK TOV  PF  PTS
## 2 0.566 308 348 0.885  46 288 334 666 128  14 294 194 1873
## 7 0.627  22  64 0.344 182 489 671 112  47 121  97 210  492
## 3 0.523 231 296 0.780 182 461 643 147  48  26 152 206 1257
## 6 0.467  82 123 0.667  86 323 409 152 102  72  91 231  511
## 1 0.533 147 185 0.795  38 211 249 181  74  37 135 234 1488