This vignette will explain the most common ways to use the .SD variable in your data.table analyses. It is an adaptation of this answer given on StackOverflow.

1 What is .SD?

In the broadest sense, .SD is just shorthand for capturing a variable that comes up frequently in the context of data analysis. It can be understood to stand for Subset, Selfsame, or Self-reference of the Data. That is, .SD is in its most basic guise a reflexive reference to the data.table itself – as we’ll see in examples below, this is particularly helpful for chaining together “queries” (extractions/subsets/etc using [). In particular, this also means that .SD is itself a data.table (with the caveat that it does not allow assignment with :=).

The simpler usage of .SD is for column subsetting (i.e., when .SDcols is specified); as this version is much more straightforward to understand, we’ll cover that first below. The interpretation of .SD in its second usage, grouping scenarios (i.e., when by = or keyby = is specified), is slightly different, conceptually (though at core it’s the same, since, after all, a non-grouped operation is an edge case of grouping with just one group).

load('Teams.RData') setDT(Teams) Teams # yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H # 1: 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N <NA> 401 1372 426 # 2: 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N <NA> 302 1196 323 # 3: 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N <NA> 249 1186 328 # 4: 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N <NA> 137 746 178 # 5: 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N <NA> 302 1404 403 # --- # 2891: 2018 NL SLN STL C 3 162 81 88 74 N N N N 759 5498 1369 # 2892: 2018 AL TBA TBD E 3 162 81 90 72 N N N N 716 5475 1415 # 2893: 2018 AL TEX TEX W 5 162 81 67 95 N N N N 737 5453 1308 # 2894: 2018 AL TOR TOR E 4 162 81 73 89 N N N N 709 5477 1336 # 2895: 2018 NL WAS WSN E 2 162 81 82 80 N N N N 771 5517 1402 # X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP # 1: 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3 828 367 2 42 23 243 24 # 2: 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1 753 308 6 28 22 229 16 # 3: 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0 762 346 13 53 34 234 15 # 4: 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0 507 261 5 21 17 163 8 # 5: 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0 879 373 7 42 22 235 14 # --- # 2891: 248 9 205 525 1380 63 32 80 48 691 622 3.85 1 8 43 4366 1354 144 593 1337 133 151 # 2892: 274 43 150 540 1388 128 51 101 50 646 602 3.74 0 14 52 4345 1236 164 501 1421 85 136 # 2893: 266 24 194 555 1484 74 35 88 34 848 783 4.92 1 5 42 4293 1516 222 491 1121 120 168 # 2894: 320 16 217 499 1387 47 30 58 37 832 772 4.85 0 3 39 4301 1476 208 551 1298 101 138 # 2895: 284 25 191 631 1289 119 33 59 40 682 649 4.04 2 7 40 4338 1320 198 487 1417 64 115 # FP name park attendance BPF PPF teamIDBR # 1: 0.834 Boston Red Stockings South End Grounds I NA 103 98 BOS # 2: 0.829 Chicago White Stockings Union Base-Ball Grounds NA 104 102 CHI # 3: 0.818 Cleveland Forest Citys National Association Grounds NA 96 100 CLE # 4: 0.803 Fort Wayne Kekiongas Hamilton Field NA 101 107 KEK # 5: 0.840 New York Mutuals Union Grounds (Brooklyn) NA 90 88 NYU # --- # 2891: 0.978 St. Louis Cardinals Busch Stadium III 3403587 97 96 STL # 2892: 0.986 Tampa Bay Rays Tropicana Field 1154973 97 97 TBR # 2893: 0.980 Texas Rangers Rangers Ballpark in Arlington 2107107 112 113 TEX # 2894: 0.983 Toronto Blue Jays Rogers Centre 2325281 97 98 TOR # 2895: 0.989 Washington Nationals Nationals Park 2529604 106 105 WSN # teamIDlahman45 teamIDretro # 1: BS1 BS1 # 2: CH1 CH1 # 3: CL1 CL1 # 4: FW1 FW1 # 5: NY2 NY2 # --- # 2891: SLN SLN # 2892: TBA TBA # 2893: TEX TEX # 2894: TOR TOR # 2895: MON WAS load('Pitching.RData') setDT(Pitching) Pitching # playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB SO BAOpp # 1: bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23 0 11 1 NA # 2: brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132 4 37 13 NA # 3: fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3 0 0 0 NA # 4: fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103 3 31 15 NA # 5: fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10 0 3 0 NA # --- # 46695: zamorda01 2018 1 NYN NL 1 0 16 0 0 0 0 27 6 3 1 3 16 0.194 # 46696: zastrro01 2018 1 CHN NL 1 0 6 0 0 0 0 17 6 3 0 4 3 0.286 # 46697: zieglbr01 2018 1 MIA NL 1 5 53 0 0 0 10 156 49 23 7 17 37 0.254 # 46698: zieglbr01 2018 2 ARI NL 1 1 29 0 0 0 0 65 22 9 1 8 13 0.265 # 46699: zimmejo02 2018 1 DET AL 7 8 25 25 0 0 0 394 140 66 28 26 111 0.269 # ERA IBB WP HBP BK BFP GF R SH SF GIDP # 1: 7.96 NA 7 NA 0 146 0 42 NA NA NA # 2: 4.50 NA 7 NA 0 1291 0 292 NA NA NA # 3: 27.00 NA 2 NA 0 14 0 9 NA NA NA # 4: 4.35 NA 20 NA 0 1080 1 257 NA NA NA # 5: 10.00 NA 0 NA 0 57 0 21 NA NA NA # --- # 46695: 3.00 1 0 1 0 36 4 3 1 0 1 # 46696: 4.76 0 0 1 0 26 2 3 0 0 0 # 46697: 3.98 4 1 2 0 213 23 25 0 1 11 # 46698: 3.74 2 0 0 0 92 1 9 0 1 3 # 46699: 4.52 0 1 2 0 556 0 76 2 5 4

2 `.SD` on Ungrouped Data

To illustrate what I mean about the reflexive nature of .SD, consider its most banal usage:

Pitching[ , .SD]
#         playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H  ER HR BB  SO BAOpp
#     1: bechtge01   1871     1    PH1   NA  1  2  3  3  2   0  0     78  43  23  0 11   1    NA
#     2: brainas01   1871     1    WS3   NA 12 15 30 30 30   0  0    792 361 132  4 37  13    NA
#     3: fergubo01   1871     1    NY2   NA  0  0  1  0  0   0  0      3   8   3  0  0   0    NA
#     4: fishech01   1871     1    RC1   NA  4 16 24 24 22   1  0    639 295 103  3 31  15    NA
#     5: fleetfr01   1871     1    NY2   NA  0  1  1  1  1   0  0     27  20  10  0  3   0    NA
#    ---                                                                                        
# 46695: zamorda01   2018     1    NYN   NL  1  0 16  0  0   0  0     27   6   3  1  3  16 0.194
# 46696: zastrro01   2018     1    CHN   NL  1  0  6  0  0   0  0     17   6   3  0  4   3 0.286
# 46697: zieglbr01   2018     1    MIA   NL  1  5 53  0  0   0 10    156  49  23  7 17  37 0.254
# 46698: zieglbr01   2018     2    ARI   NL  1  1 29  0  0   0  0     65  22   9  1  8  13 0.265
# 46699: zimmejo02   2018     1    DET   AL  7  8 25 25  0   0  0    394 140  66 28 26 111 0.269
#          ERA IBB WP HBP BK  BFP GF   R SH SF GIDP
#     1:  7.96  NA  7  NA  0  146  0  42 NA NA   NA
#     2:  4.50  NA  7  NA  0 1291  0 292 NA NA   NA
#     3: 27.00  NA  2  NA  0   14  0   9 NA NA   NA
#     4:  4.35  NA 20  NA  0 1080  1 257 NA NA   NA
#     5: 10.00  NA  0  NA  0   57  0  21 NA NA   NA
#    ---                                           
# 46695:  3.00   1  0   1  0   36  4   3  1  0    1
# 46696:  4.76   0  0   1  0   26  2   3  0  0    0
# 46697:  3.98   4  1   2  0  213 23  25  0  1   11
# 46698:  3.74   2  0   0  0   92  1   9  0  1    3
# 46699:  4.52   0  1   2  0  556  0  76  2  5    4

That is, Pitching[ , .SD] has simply returned the whole table, i.e., this was an overly verbose way of writing Pitching or Pitching[]:

identical(Pitching, Pitching[ , .SD])
# [1] TRUE

In terms of subsetting, .SD is still a subset of the data, it’s just a trivial one (the set itself).

2.1 Column Subsetting: `.SDcols`

The first way to impact what .SD is is to limit the columns contained in .SD using the .SDcols argument to [:

# W: Wins; L: Losses; G: Games
Pitching[ , .SD, .SDcols = c('W', 'L', 'G')]
#         W  L  G
#     1:  1  2  3
#     2: 12 15 30
#     3:  0  0  1
#     4:  4 16 24
#     5:  0  1  1
#    ---         
# 46695:  1  0 16
# 46696:  1  0  6
# 46697:  1  5 53
# 46698:  1  1 29
# 46699:  7  8 25

This is just for illustration and was pretty boring. But even this simply usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:

2.2 Column Type Conversion

Column type conversion is a fact of life for data munging. Though fwrite recently gained the ability to declare the class of each column up front, not all data sets come from fread (e.g. in this vignette) and conversions back and forth among character/factor/numeric types are common. We can use .SD and .SDcols to batch-convert groups of columns to a common type.

We notice that the following columns are stored as character in the Teams data set, but might more logically be stored as factors:

# teamIDBR: Team ID used by Baseball Reference website
# teamIDlahman45: Team ID used in Lahman database version 4.5
# teamIDretro: Team ID used by Retrosheet
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
Teams[ , sapply(.SD, is.character), .SDcols = fkt]
#       teamIDBR teamIDlahman45    teamIDretro 
#           TRUE           TRUE           TRUE

If you’re confused by the use of sapply here, note that it’s quite similar for base R data.frames:

setDF(Teams) # convert to data.frame for illustration
sapply(Teams[ , fkt], is.character)
#       teamIDBR teamIDlahman45    teamIDretro 
#           TRUE           TRUE           TRUE
setDT(Teams) # convert back to data.table

The key to understanding this syntax is to recall that a data.table (as well as a data.frame) can be considered as a list where each element is a column – thus, sapply/lapply applies the FUN argument (in this case, is.character) to each column and returns the result as sapply/lapply usually would.

The syntax to now convert these columns to factor is very similar – simply add the := assignment operator:

Teams[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
# print out the first column to demonstrate success
head(unique(Teams[[fkt[1L]]]))
# [1] BOS CHI CLE KEK NYU ATH
# 101 Levels: ALT ANA ARI ATH ATL BAL BLA BLN BLU BOS BRA BRG BRO BSN BTT BUF BWW CAL CEN CHC ... WSN

Note that we must wrap fkt in parentheses () to force data.table to interpret this as column names, instead of trying to assign a column named 'fkt'.

Actually, the .SDcols argument is quite flexible; above, we supplied a character vector of column names. In other situations, it is more convenient to supply an integer vector of column positions or a logical vector dictating include/exclude for each column. .SDcols even accepts regular expression-based pattern matching.

For example, we could do the following to convert all factor columns to character:

# while .SDcols accepts a logical vector,
#   := does not, so we need to convert to column
#   positions with which()
fkt_idx = which(sapply(Teams, is.factor))
Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx]
head(unique(Teams[[fkt_idx[1L]]]))
# [1] "NA" "NL" "AA" "UA" "PL" "AL"

Lastly, we can do pattern-based matching of columns in .SDcols to select all columns which contain team back to factor:

Teams[ , .SD, .SDcols = patterns('team')]
#       teamID teamIDBR teamIDlahman45 teamIDretro
#    1:    BS1      BOS            BS1         BS1
#    2:    CH1      CHI            CH1         CH1
#    3:    CL1      CLE            CL1         CL1
#    4:    FW1      KEK            FW1         FW1
#    5:    NY2      NYU            NY2         NY2
#   ---                                           
# 2891:    SLN      STL            SLN         SLN
# 2892:    TBA      TBR            TBA         TBA
# 2893:    TEX      TEX            TEX         TEX
# 2894:    TOR      TOR            TOR         TOR
# 2895:    WAS      WSN            MON         WAS

# now convert these columns to factor;
#   value = TRUE in grep() is for the LHS of := to
#   get column names instead of positions
team_idx = grep('team', names(Teams), value = TRUE)
Teams[ , (team_idx) := lapply(.SD, factor), .SDcols = team_idx]

** A proviso to the above: explicitly using column numbers (like DT[ , (1) := rnorm(.N)]) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don’t keep smart/strict control over the ordering of when we create the numbered index and when we use it.

2.3 Controlling a Model’s Right-Hand Side

Varying model specification is a core feature of robust statistical analysis. Let’s try and predict a pitcher’s ERA (Earned Runs Average, a measure of performance) using the small set of covariates available in the Pitching table. How does the (linear) relationship between W (wins) and ERA vary depending on which other covariates are included in the specification?

Here’s a short script leveraging the power of .SD which explores this question:

# this generates a list of the 2^k possible extra variables
#   for models of the form ERA ~ G + (...)
extra_var = c('yearID', 'teamID', 'G', 'L')
models = unlist(
  lapply(0L:length(extra_var), combn, x = extra_var, simplify = FALSE),
  recursive = FALSE
)

# here are 16 visually distinct colors, taken from the list of 20 here:
#   https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
col16 = c('#e6194b', '#3cb44b', '#ffe119', '#0082c8',
          '#f58231', '#911eb4', '#46f0f0', '#f032e6',
          '#d2f53c', '#fabebe', '#008080', '#e6beff',
          '#aa6e28', '#fffac8', '#800000', '#aaffc3')

par(oma = c(2, 0, 0, 0))
lm_coef = sapply(models, function(rhs) {
  # using ERA ~ . and data = .SD, then varying which
  #   columns are included in .SD allows us to perform this
  #   iteration over 16 models succinctly.
  #   coef(.)['W'] extracts the W coefficient from each model fit
  Pitching[ , coef(lm(ERA ~ ., data = .SD))['W'], .SDcols = c('W', rhs)]
})
barplot(lm_coef, names.arg = sapply(models, paste, collapse = '/'),
        main = 'Wins Coefficient\nWith Various Covariates',
        col = col16, las = 2L, cex.names = .8)

The coefficient always has the expected sign (better pitchers tend to have more wins and fewer runs allowed), but the magnitude can vary substantially depending on what else we control for.

2.4 Conditional Joins

data.table syntax is beautiful for its simplicity and robustness. The syntax x[i] flexibly handles three common approaches to subsetting – when i is a logical vector, x[i] will return those rows of x corresponding to where i is TRUE; when i is another data.table (or a list), a (right) join is performed (in the plain form, using the keys of x and i, otherwise, when on = is specified, using matches of those columns); and when i is a character, it is interpreted as shorthand for x[list(i)], i.e., as a join.

This is great in general, but falls short when we wish to perform a conditional join, wherein the exact nature of the relationship among tables depends on some characteristics of the rows in one or more columns.

This example is admittedly a tad contrived, but illustrates the idea; see here (1, 2) for more.

The goal is to add a column team_performance to the Pitching table that records the team’s performance (rank) of the best pitcher on each team (as measured by the lowest ERA, among pitchers with at least 6 recorded games).

# to exclude pitchers with exceptional performance in a few games,
#   subset first; then define rank of pitchers within their team each year
#   (in general, we should put more care into the 'ties.method' of frank)
Pitching[G > 5, rank_in_team := frank(ERA), by = .(teamID, yearID)]
Pitching[rank_in_team == 1, team_performance :=
           Teams[.SD, Rank, on = c('teamID', 'yearID')]]

Note that the x[y] syntax returns nrow(y) values (i.e., it’s a right join), which is why .SD is on the right in Teams[.SD] (since the RHS of := in this case requires nrow(Pitching[rank_in_team == 1]) values.

# the data is already sorted by year; if it weren't # we could do Teams[order(yearID), .SD[.N], by = teamID] Teams[ , .SD[.N], by = teamID] # teamID yearID lgID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H # 1: BS1 1875 NA BNA <NA> 1 82 NA 71 8 <NA> <NA> Y <NA> 831 3515 1128 # 2: CH1 1871 NA CNA <NA> 2 28 NA 19 9 <NA> <NA> N <NA> 302 1196 323 # 3: CL1 1872 NA CFC <NA> 7 22 NA 6 16 <NA> <NA> N <NA> 174 943 272 # 4: FW1 1871 NA KEK <NA> 7 19 NA 7 12 <NA> <NA> N <NA> 137 746 178 # 5: NY2 1875 NA NNA <NA> 6 71 NA 30 38 <NA> <NA> N <NA> 328 2685 633 # --- # 145: ANA 2004 AL ANA W 1 162 81 92 70 Y N N N 836 5675 1603 # 146: ARI 2018 NL ARI W 3 162 81 82 80 N N N N 693 5460 1283 # 147: MIL 2018 NL MIL C 1 163 81 96 67 Y N N N 754 5542 1398 # 148: TBA 2018 AL TBD E 3 162 81 90 72 N N N N 716 5475 1415 # 149: MIA 2018 NL FLA E 5 161 81 63 98 N N N N 589 5488 1303 # X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP # 1: 167 51 15 33 52 93 37 NA NA 343 152 1.87 60 10 17 2196 751 2 33 110 483 56 # 2: 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1 753 308 6 28 22 229 16 # 3: 28 5 0 17 13 12 3 NA NA 254 126 5.70 15 0 0 597 285 6 24 11 184 17 # 4: 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0 507 261 5 21 17 163 8 # 5: 82 21 7 19 47 20 24 NA NA 425 174 2.46 70 3 0 1910 718 4 21 77 526 30 # --- # 145: 272 37 162 450 942 143 46 73 41 734 692 4.28 2 11 50 4363 1476 170 502 1164 90 126 # 146: 259 50 176 560 1460 79 25 52 45 644 605 3.72 2 9 39 4389 1313 174 522 1448 75 152 # 147: 252 24 218 537 1458 124 32 58 41 659 606 3.73 0 14 49 4383 1259 173 553 1428 108 141 # 148: 274 43 150 540 1388 128 51 101 50 646 602 3.74 0 14 52 4345 1236 164 501 1421 85 136 # 149: 222 24 128 455 1384 45 31 73 31 809 762 4.76 1 12 30 4326 1388 192 605 1249 83 133 # FP name park attendance BPF PPF teamIDBR # 1: 0.870 Boston Red Stockings South End Grounds I NA 103 96 BOS # 2: 0.829 Chicago White Stockings Union Base-Ball Grounds NA 104 102 CHI # 3: 0.816 Cleveland Forest Citys National Association Grounds NA 96 100 CLE # 4: 0.803 Fort Wayne Kekiongas Hamilton Field NA 101 107 KEK # 5: 0.838 New York Mutuals Union Grounds (Brooklyn) NA 99 100 NYU # --- # 145: 0.985 Anaheim Angels Angels Stadium of Anaheim 3375677 97 97 ANA # 146: 0.988 Arizona Diamondbacks Chase Field 2242695 108 107 ARI # 147: 0.982 Milwaukee Brewers Miller Park 2850875 102 101 MIL # 148: 0.986 Tampa Bay Rays Tropicana Field 1154973 97 97 TBR # 149: 0.986 Miami Marlins Marlins Park 811104 89 90 MIA # teamIDlahman45 teamIDretro # 1: BS1 BS1 # 2: CH1 CH1 # 3: CL1 CL1 # 4: FW1 FW1 # 5: NY2 NY2 # --- # 145: ANA ANA # 146: ARI ARI # 147: ML4 MIL # 148: TBA TBA # 149: FLO MIA

Teams[ , .SD[which.max(R)], by = teamID] # teamID yearID lgID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H # 1: BS1 1875 NA BNA <NA> 1 82 NA 71 8 <NA> <NA> Y <NA> 831 3515 1128 # 2: CH1 1871 NA CNA <NA> 2 28 NA 19 9 <NA> <NA> N <NA> 302 1196 323 # 3: CL1 1871 NA CFC <NA> 8 29 NA 10 19 <NA> <NA> N <NA> 249 1186 328 # 4: FW1 1871 NA KEK <NA> 7 19 NA 7 12 <NA> <NA> N <NA> 137 746 178 # 5: NY2 1872 NA NNA <NA> 3 56 NA 34 20 <NA> <NA> N <NA> 523 2426 670 # --- # 145: ANA 2000 AL ANA W 3 162 81 82 80 N N N N 864 5628 1574 # 146: ARI 1999 NL ARI W 1 162 81 100 62 Y N N N 908 5658 1566 # 147: MIL 1999 NL MIL C 5 161 80 74 87 N N N N 815 5582 1524 # 148: TBA 2009 AL TBD E 3 162 81 84 78 N N N N 803 5462 1434 # 149: MIA 2017 NL FLA E 2 162 78 77 85 N N N N 778 5602 1497 # X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP # 1: 167 51 15 33 52 93 37 NA NA 343 152 1.87 60 10 17 2196 751 2 33 110 483 56 # 2: 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1 753 308 6 28 22 229 16 # 3: 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0 762 346 13 53 34 234 15 # 4: 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0 507 261 5 21 17 163 8 # 5: 87 14 4 58 52 59 22 NA NA 362 172 3.02 54 3 1 1536 622 2 33 46 323 33 # --- # 145: 309 34 236 608 1024 93 52 47 43 869 805 5.00 5 3 46 4344 1534 228 662 846 134 182 # 146: 289 46 216 588 1045 137 39 48 60 676 615 3.77 16 9 42 4402 1387 176 543 1198 104 132 # 147: 299 30 165 658 1065 81 33 55 51 886 813 5.07 2 5 40 4328 1618 213 616 987 127 146 # 148: 297 36 199 642 1229 194 61 49 45 754 686 4.33 3 5 41 4282 1421 183 515 1125 98 135 # 149: 271 31 194 486 1282 91 30 67 41 822 772 4.82 1 7 34 4328 1450 193 627 1202 73 156 # FP name park attendance BPF PPF teamIDBR # 1: 0.870 Boston Red Stockings South End Grounds I NA 103 96 BOS # 2: 0.829 Chicago White Stockings Union Base-Ball Grounds NA 104 102 CHI # 3: 0.818 Cleveland Forest Citys National Association Grounds NA 96 100 CLE # 4: 0.803 Fort Wayne Kekiongas Hamilton Field NA 101 107 KEK # 5: 0.868 New York Mutuals Union Grounds (Brooklyn) NA 93 92 NYU # --- # 145: 0.978 Anaheim Angels Edison International Field 2066982 102 103 ANA # 146: 0.983 Arizona Diamondbacks Bank One Ballpark 3019654 101 101 ARI # 147: 0.979 Milwaukee Brewers County Stadium 1701796 99 99 MIL # 148: 0.983 Tampa Bay Rays Tropicana Field 1874962 98 97 TBR # 149: 0.988 Miami Marlins Marlins Park 1583014 93 93 MIA # teamIDlahman45 teamIDretro # 1: BS1 BS1 # 2: CH1 CH1 # 3: CL1 CL1 # 4: FW1 FW1 # 5: NY2 NY2 # --- # 145: ANA ANA # 146: ARI ARI # 147: ML4 MIL # 148: TBA TBA # 149: FLO MIA

# Overall coefficient for comparison overall_coef = Pitching[ , coef(lm(ERA ~ W))['W']] # use the .N > 20 filter to exclude teams with few observations Pitching[ , if (.N > 20L) .(w_coef = coef(lm(ERA ~ W))['W']), by = teamID ][ , hist(w_coef, 20L, las = 1L, xlab = 'Fitted Coefficient on W', ylab = 'Number of Teams', col = 'darkgreen', main = 'Team-Level Distribution\nWin Coefficients on ERA')] abline(v = overall_coef, lty = 2L, col = 'red')

Using .SD for Data Analysis

2020-07-22

1 What is `.SD`?

1.1 Loading and Previewing Lahman Data

2 `.SD` on Ungrouped Data

2.1 Column Subsetting: `.SDcols`

2.2 Column Type Conversion

2.3 Controlling a Model’s Right-Hand Side

2.4 Conditional Joins

3 Grouped `.SD` operations

3.1 Group Subsetting

3.2 Group Optima

3.3 Grouped Regression

Using .SD for Data Analysis

2020-07-22

1 What is .SD?

1.1 Loading and Previewing Lahman Data

2 .SD on Ungrouped Data

2.1 Column Subsetting: .SDcols

2.2 Column Type Conversion

2.3 Controlling a Model’s Right-Hand Side

2.4 Conditional Joins

3 Grouped .SD operations

3.1 Group Subsetting

3.2 Group Optima

3.3 Grouped Regression

1 What is `.SD`?

2 `.SD` on Ungrouped Data

2.1 Column Subsetting: `.SDcols`

3 Grouped `.SD` operations