title: “Using Outlier Detection Algorithms to Analyze NBA Players” author: “Cheng Fan” date: “April 1, 2015”
#Section 1: Overview
This is a short tutorial on using outlier detection algorithms to analyze NBA players. The data to be used is called GoldenStatesWarriors, which has been included in the package. The data contains the statistics of all players in the Golden States Warriors during the 2013-2014 season. In total, 18 players are included. The data contains 27 variables, which specifies the following attributes, player names (Name), age (Age), games played (G), games started (GS), minutes played (MP), field goal made (FG), field goal attemps (FGA), field goal percentage (FGP), 3-pointers made (3P), 3-pointer attemps (3PA), 3-pointer percentage (3PP), 2-pointers made (2P), 2-pointer attemps (2PA), 2-pointer percentages (2PP), effective field goal percentage (eFGP), free throws made (FT), free throw attemps (FTA), free throw percentage (FTP), offensive rebounds (ORB), defensive rebounds (DRB), total rebounds (TRB), assists (AST), steals (STL), blocks (BLK), turnovers (TOV), personal fouls (PF), and total points (PTS).
The summary of the data are shown as below:
## Name Age G GS
## Length:18 Min. :21.00 Min. : 4.00 Min. : 0.00
## Class :character 1st Qu.:23.25 1st Qu.:24.00 1st Qu.: 0.00
## Mode :character Median :25.00 Median :44.00 Median : 2.00
## Mean :26.33 Mean :47.22 Mean :22.78
## 3rd Qu.:29.00 3rd Qu.:75.75 3rd Qu.:53.25
## Max. :35.00 Max. :82.00 Max. :81.00
##
## MP FG FGA FGP
## Min. : 6.0 Min. : 0.00 Min. : 1.00 Min. :0.0000
## 1st Qu.: 172.8 1st Qu.: 14.75 1st Qu.: 50.75 1st Qu.:0.3775
## Median : 772.0 Median :132.00 Median : 290.50 Median :0.4120
## Mean :1101.7 Mean :179.78 Mean : 389.17 Mean :0.4044
## 3rd Qu.:1979.2 3rd Qu.:231.25 3rd Qu.: 458.75 3rd Qu.:0.4733
## Max. :2868.0 Max. :652.00 Max. :1383.00 Max. :0.6270
##
## 3P 3PA 3PP 2P
## Min. : 0.00 Min. : 0.00 Min. :0.0000 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.25 1st Qu.:0.2560 1st Qu.: 10.0
## Median : 9.00 Median : 35.00 Median :0.3220 Median :112.0
## Mean : 43.00 Mean :113.17 Mean :0.2718 Mean :136.8
## 3rd Qu.: 51.25 3rd Qu.:155.75 3rd Qu.:0.3470 3rd Qu.:200.5
## Max. :261.00 Max. :615.00 Max. :0.4240 Max. :513.0
## NA's :5
## 2PA 2PP eFGP FT
## Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. : 1.00
## 1st Qu.: 22.5 1st Qu.:0.4412 1st Qu.:0.4293 1st Qu.: 5.50
## Median :226.5 Median :0.4595 Median :0.4755 Median : 29.00
## Mean :276.0 Mean :0.4395 Mean :0.4458 Mean : 72.39
## 3rd Qu.:402.0 3rd Qu.:0.4988 3rd Qu.:0.5182 3rd Qu.:107.25
## Max. :980.0 Max. :0.6270 Max. :0.6270 Max. :308.00
##
## FTA FTP ORB DRB
## Min. : 2.00 Min. :0.3440 Min. : 0.00 Min. : 0.0
## 1st Qu.: 8.75 1st Qu.:0.5523 1st Qu.: 5.00 1st Qu.: 23.5
## Median : 53.50 Median :0.6925 Median : 28.50 Median :105.0
## Mean : 96.17 Mean :0.6692 Mean : 49.78 Mean :156.6
## 3rd Qu.:133.50 3rd Qu.:0.7913 3rd Qu.: 80.25 3rd Qu.:243.8
## Max. :348.00 Max. :0.8850 Max. :182.00 Max. :489.0
##
## TRB AST STL BLK
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 29.5 1st Qu.: 14.5 1st Qu.: 4.50 1st Qu.: 2.50
## Median :152.5 Median : 44.5 Median : 13.50 Median : 10.00
## Mean :206.4 Mean :106.2 Mean : 35.67 Mean : 22.61
## 3rd Qu.:306.5 3rd Qu.:139.2 3rd Qu.: 60.75 3rd Qu.: 32.00
## Max. :671.0 Max. :666.0 Max. :128.00 Max. :121.00
##
## TOV PF PTS
## Min. : 0.00 Min. : 0.00 Min. : 1.0
## 1st Qu.: 13.75 1st Qu.: 21.25 1st Qu.: 41.5
## Median : 51.00 Median : 73.00 Median : 351.5
## Mean : 68.06 Mean : 99.11 Mean : 474.9
## 3rd Qu.: 95.50 3rd Qu.:185.00 3rd Qu.: 568.8
## Max. :294.00 Max. :234.00 Max. :1873.0
##
#Section 2: Data preprocessing
It is obvious that the variables in the data have different scales. Prior to the implementation of outlier detection algorithms, it is important to normalize the raw data. A Z-normalization is used for data preprocessing (except for the first column, which is the player name). One thing to note is that the raw data contains NAs. This is because for certain players, they did not make any attemps on 3-pointers and therefore, the resulting 3-pointer percentage is NA. Once the data is scaled, these missing values are filled as 0.
library(plyr)
data.scale <- t(aaply(.data = as.matrix(GoldenStatesWarriors[,-1]), .margins = 2, .fun = function(x) (x-mean(x, na.rm = T))/sd(x, na.rm = T)))
summary(data.scale)
## Age G GS MP
## Min. :-1.3690 Min. :-1.5584 Min. :-0.7171 Min. :-1.0701
## 1st Qu.:-0.7915 1st Qu.:-0.8373 1st Qu.:-0.7171 1st Qu.:-0.9073
## Median :-0.3423 Median :-0.1162 Median :-0.6541 Median :-0.3220
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6845 3rd Qu.: 1.0286 3rd Qu.: 0.9593 3rd Qu.: 0.8571
## Max. : 2.2247 Max. : 1.2539 Max. : 1.8329 Max. : 1.7251
##
## FG FGA FGP 3P
## Min. :-0.8806 Min. :-0.9007 Min. :-3.0511 Min. :-0.5645
## 1st Qu.:-0.8084 1st Qu.:-0.7853 1st Qu.:-0.2033 1st Qu.:-0.5645
## Median :-0.2340 Median :-0.2289 Median : 0.0570 Median :-0.4463
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2521 3rd Qu.: 0.1615 3rd Qu.: 0.5191 3rd Qu.: 0.1083
## Max. : 2.3131 Max. : 2.3061 Max. : 1.6789 Max. : 2.8617
##
## 3PA 3PP 2P 2PA
## Min. :-0.6253 Min. :-1.9685 Min. :-0.8990 Min. :-0.9170
## 1st Qu.:-0.6239 1st Qu.:-0.1142 1st Qu.:-0.8332 1st Qu.:-0.8453
## Median :-0.4319 Median : 0.3638 Median :-0.1628 Median :-0.1651
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2353 3rd Qu.: 0.5449 3rd Qu.: 0.4188 3rd Qu.: 0.4202
## Max. : 2.7729 Max. : 1.1026 Max. : 2.4727 Max. : 2.3476
## NA's :5
## 2PP eFGP FT FTA
## Min. :-3.26327 Min. :-3.2123 Min. :-0.8140 Min. :-0.9027
## 1st Qu.: 0.01299 1st Qu.:-0.1195 1st Qu.:-0.7627 1st Qu.:-0.8380
## Median : 0.14850 Median : 0.2138 Median :-0.4947 Median :-0.4090
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.43993 3rd Qu.: 0.5218 3rd Qu.: 0.3975 3rd Qu.: 0.3579
## Max. : 1.39218 Max. : 1.3053 Max. : 2.6865 Max. : 2.4141
##
## FTP ORB DRB TRB
## Min. :-2.0177 Min. :-0.8535 Min. :-0.9869 Min. :-0.9664
## 1st Qu.:-0.7255 1st Qu.:-0.7677 1st Qu.:-0.8388 1st Qu.:-0.8283
## Median : 0.1448 Median :-0.3648 Median :-0.3252 Median :-0.2523
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7575 3rd Qu.: 0.5225 3rd Qu.: 0.5491 3rd Qu.: 0.4688
## Max. : 1.3393 Max. : 2.2670 Max. : 2.0945 Max. : 2.1756
##
## AST STL BLK TOV
## Min. :-0.6684 Min. :-0.8757 Min. :-0.7267 Min. :-0.9301
## 1st Qu.:-0.5771 1st Qu.:-0.7652 1st Qu.:-0.6464 1st Qu.:-0.7422
## Median :-0.3884 Median :-0.5443 Median :-0.4053 Median :-0.2331
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2078 3rd Qu.: 0.6159 3rd Qu.: 0.3018 3rd Qu.: 0.3751
## Max. : 3.5223 Max. : 2.2670 Max. : 3.1623 Max. : 3.0880
##
## PF PTS
## Min. :-1.1330 Min. :-0.8590
## 1st Qu.:-0.8901 1st Qu.:-0.7856
## Median :-0.2985 Median :-0.2237
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9818 3rd Qu.: 0.1700
## Max. : 1.5420 Max. : 2.5339
##
data.scale[is.na(data.scale)] <- 0
#Section 3: Use of outlier detection algorithms
The basic version of ABOD use all the other observations in the data to evaluate the outlierness of a certain observation. As a result, the computation can be very time-consuming. The approximated version of ABOD use a subset to evaluate the outlierness. Even though the speed performance can be improved, the results sometimes may not be that reliable. For this data set, the size is rather small. The basic version of ABOD is used for outlier detection. The raw scores of ABOD are transformed into probabilities using the outlier unification scheme, which is proposed by Kriegel, Kroger, Schubert, and Zimek in 2011.
score.ABOD <- Func.ABOD(data = data.scale, basic = T)
score.trans.ABOD <- Func.trans(raw.score = score.ABOD, method = "ABOD")
The top 5 identified outliers are Stephen Curry, Dewayne Dedmon, Andrew Bogut, David Lee, and Klay Thompson. Except for Dewayne Dedmon, all the other 4 players are the starting lineups for Golden States Warriors during that season. Stephen Curry and Klay Thompson are selected as outliers due to their excellent scoring abilities and shooting skills. In addition, Curry has the highest assists rank in the team. Andrew Bogut is a defensive anchor, who contributes a lot to the rebounds and blocks. David Lee shows his importance on both end of the floors. He can scores, as well as rebounding. The other oultier is Dewayne Dedmon. This is reasonable as he only played 4 games for Warriors and his performance was not very impressive.
GoldenStatesWarriors$Name[order(score.trans.ABOD, decreasing = T)[1:5]]
## [1] "Stephen Curry" "Dewayne Dedmon" "Andrew Bogut" "David Lee"
## [5] "Klay Thompson"
GoldenStatesWarriors[order(score.trans.ABOD, decreasing = T)[1:5],]
## Name Age G GS MP FG FGA FGP 3P 3PA 3PP 2P 2PA
## 2 Stephen Curry 25 78 78 2846 652 1383 0.471 261 615 0.424 391 768
## 18 Dewayne Dedmon 24 4 0 6 0 1 0.000 0 0 NA 0 1
## 7 Andrew Bogut 29 67 67 1769 235 375 0.627 0 0 NA 235 375
## 3 David Lee 30 69 67 2288 513 981 0.523 0 1 0.000 513 980
## 1 Klay Thompson 23 81 81 2868 559 1259 0.444 223 535 0.417 336 724
## 2PP eFGP FT FTA FTP ORB DRB TRB AST STL BLK TOV PF PTS
## 2 0.509 0.566 308 348 0.885 46 288 334 666 128 14 294 194 1873
## 18 0.000 0.000 1 2 0.500 0 0 0 0 0 0 0 1 1
## 7 0.627 0.627 22 64 0.344 182 489 671 112 47 121 97 210 492
## 3 0.523 0.523 231 296 0.780 182 461 643 147 48 26 152 206 1257
## 1 0.464 0.533 147 185 0.795 38 211 249 181 74 37 135 234 1488
Next, the SOD algorithm is used. Considering the data size, 5 will be reasonable to construct the reference set. k.nn is set as 10. Note that k.nn should be larger than k.sel. alpha is set as default, i.e., 0.8. This default is recommended by the authors of this algorithm.
The top 5 outliers revealed are quite similar to those identified by ABOD. The only difference is that instead of picking the bench player, the SOD algorithm chooses another important player in the Warrior team, Draymond Green. Even though Green was not very often chosen as the starting lineup, he has played even more minutes than the starting Bogut. His role is the 6th man in the team. During the 2013-2014 team, Green really came out strong. As a small forward, he contributed a lot to the defensive side, such as rebounding and blocking. In addition, he could also contribute to the offensive side. So to me, it is very reasonable to select Green as one of the outliers in the Warrior team.
score.SOD <- Func.SOD(data = data.scale, k.nn = 10, k.sel = 5, alpha = .8)
score.trans.SOD <- Func.trans(raw.score = score.SOD, method = "SOD")
GoldenStatesWarriors$Name[order(score.trans.SOD, decreasing = T)[1:5]]
## [1] "Stephen Curry" "Andrew Bogut" "David Lee" "Draymond Green"
## [5] "Klay Thompson"
GoldenStatesWarriors[order(score.trans.SOD, decreasing = T)[1:5],]
## Name Age G GS MP FG FGA FGP 3P 3PA 3PP 2P 2PA 2PP
## 2 Stephen Curry 25 78 78 2846 652 1383 0.471 261 615 0.424 391 768 0.509
## 7 Andrew Bogut 29 67 67 1769 235 375 0.627 0 0 NA 235 375 0.627
## 3 David Lee 30 69 67 2288 513 981 0.523 0 1 0.000 513 980 0.523
## 6 Draymond Green 23 82 12 1797 187 459 0.407 55 165 0.333 132 294 0.449
## 1 Klay Thompson 23 81 81 2868 559 1259 0.444 223 535 0.417 336 724 0.464
## eFGP FT FTA FTP ORB DRB TRB AST STL BLK TOV PF PTS
## 2 0.566 308 348 0.885 46 288 334 666 128 14 294 194 1873
## 7 0.627 22 64 0.344 182 489 671 112 47 121 97 210 492
## 3 0.523 231 296 0.780 182 461 643 147 48 26 152 206 1257
## 6 0.467 82 123 0.667 86 323 409 152 102 72 91 231 511
## 1 0.533 147 185 0.795 38 211 249 181 74 37 135 234 1488