Error estimation

2020-02-05

For the most part, this document will present the functionalities of the function surveysd::calc.stError() which generates point estimates and standard errors for user-supplied estimation functions.

Prerequisites

In order to use a dataset with calc.stError(), several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from demo.eusilc() and attach the bootstrap weights using draw.bootstrap() and recalib(). Please refer to the documentation of those functions for more detail.

library(surveysd)

set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
                           strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
                          epsP = 1e-2, epsH = 2.5e-2, verbose = TRUE)
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]

## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
year povertyRisk eqIncome onePerson pWeight w1 w2 w3 w4 w5
2010 FALSE 16090.69 FALSE 504.5696 1.459066 1006.555 1.455580 1.462730 1018.671
2010 FALSE 16090.69 FALSE 504.5696 1.459066 1006.555 1.455580 1.462730 1018.671
2010 FALSE 16090.69 FALSE 504.5696 1.459066 1006.555 1.455580 1.462730 1018.671
2011 FALSE 16090.69 FALSE 504.5696 1.380779 967.380 1.456154 1.516569 1004.087
2011 FALSE 16090.69 FALSE 504.5696 1.380779 967.380 1.456154 1.516569 1004.087

Estimator functions

The parameters fun and var in calc.stError() define the estimator to be used in the error analysis. There are two built-in estimator functions weightedSum() and weightedRatio() which can be used as follows.

povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)

Those functions calculate the ratio of persons at risk of povery (in percent) and the total income. By default, the results are calculated seperately for each reference period.

povertyRate$Estimates
year n N val_povertyRisk stE_povertyRisk
2010 14827 8182222 14.44422 0.5829182
2011 14827 8182222 14.77393 0.6761383
2012 14827 8182222 15.04515 0.4903267
2013 14827 8182222 14.89013 0.4751962
2014 14827 8182222 15.14556 0.5188642
2015 14827 8182222 15.53640 0.4798736
2016 14827 8182222 15.08315 0.3559527
2017 14827 8182222 15.42019 0.5582783
totalIncome$Estimates
year n N val_eqIncome stE_eqIncome
2010 14827 8182222 162750998071 959510036
2011 14827 8182222 161926931417 900201396
2012 14827 8182222 162576509628 1195965007
2013 14827 8182222 163199507862 1472304274
2014 14827 8182222 163986275009 1525837944
2015 14827 8182222 163416275447 1335217370
2016 14827 8182222 162706205137 1204524680
2017 14827 8182222 164314959107 1320123257

Columns that use the val_ prefix denote the point estimate belonging to the “main weight” of the dataset, which is pWeight in case of the dataset used here.

Columns with the stE_ prefix denote standard errors calculated with bootstrap replicates. The replicates result in using w1, w2, …, w10 instead of pWeight when applying the estimator.

n denotes the number of observations for the year and N denotes the total weight of those persons.

Custom estimators

In order to define a custom estimator function to be used in fun, the function needs to have two arguments like the example below.

## [1] TRUE

The parameters x and w can be assumed to be vectors with equal length with w being numeric and x being the column defined in the var argument. It will be called once for each period (in this case year) and for each weight column (in this case pWeight, w1, w2, …, w10).

Multiple estimators

In case an estimator should be applied to several columns of the dataset, var can be set to a vector containing all necessary columns.

year n N val_povertyRisk stE_povertyRisk val_onePerson stE_onePerson
2010 14827 8182222 14.44422 0.5829182 14.85737 0.2568455
2011 14827 8182222 14.77393 0.6761383 14.85737 0.2774971
2012 14827 8182222 15.04515 0.4903267 14.85737 0.2905952
2013 14827 8182222 14.89013 0.4751962 14.85737 0.3452929
2014 14827 8182222 15.14556 0.5188642 14.85737 0.4386546
2015 14827 8182222 15.53640 0.4798736 14.85737 0.3803537
2016 14827 8182222 15.08315 0.3559527 14.85737 0.3099259
2017 14827 8182222 15.42019 0.5582783 14.85737 0.3025539

Here we see the relative number of persons at risk of poverty and the relative number of one-person households.

Grouping

The groups argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in dat. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used.

dat2 <- subset(dat_boot_calib, year == 2010)
for (att  in c("period", "weights", "b.rep"))
  attr(dat2, att) <- attr(dat_boot_calib, att)

To calculate the ratio of persons at risk of poverty for each federal state of austria, group = "region" can be used.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
year n N region val_povertyRisk stE_povertyRisk
2010 549 260564 Burgenland 19.53984 1.7963347
2010 733 377355 Vorarlberg 16.53731 3.2774567
2010 924 535451 Salzburg 13.78734 2.2523362
2010 1078 563648 Carinthia 13.08627 1.7427986
2010 1317 701899 Tyrol 15.30819 1.3916058
2010 2295 1167045 Styria 14.37464 1.3485703
2010 2322 1598931 Vienna 17.23468 1.0419000
2010 2804 1555709 Lower Austria 13.84362 1.7019212
2010 2805 1421620 Upper Austria 10.88977 0.8701448
2010 14827 8182222 NA 14.44422 0.5829182

The last column with region = NA denotes the aggregate over all regions. Note that the columns N and n now show the weighted and unweighted number of persons in each region.

Several grouping variables

In case more than one grouping variable is used, there are several options of calling calc.stError() depending on whether combinations of grouping levels should be regarded or not. We will consider the variables gender and region as our grouping variables and show three options on how calc.stError() can be called.

Option 1: All regions and all genders

Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]

The last row is again the estimate for the whole period.

year n N gender region val_povertyRisk stE_povertyRisk
2010 549 260564 NA Burgenland 19.53984 1.7963347
2010 733 377355 NA Vorarlberg 16.53731 3.2774567
2010 924 535451 NA Salzburg 13.78734 2.2523362
2010 1078 563648 NA Carinthia 13.08627 1.7427986
2010 1317 701899 NA Tyrol 15.30819 1.3916058
2010 2295 1167045 NA Styria 14.37464 1.3485703
2010 2322 1598931 NA Vienna 17.23468 1.0419000
2010 2804 1555709 NA Lower Austria 13.84362 1.7019212
2010 2805 1421620 NA Upper Austria 10.88977 0.8701448
2010 7267 3979572 male NA 12.02660 0.6294860
2010 7560 4202650 female NA 16.73351 0.6290441
2010 14827 8182222 NA NA 14.44422 0.5829182

Option 2: All combinations of state and gender

Split the data by all cobinations of the two grouping variables. This will result in a larger output-table of the size

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]

year n N gender region val_povertyRisk stE_povertyRisk
2010 261 122741.8 male Burgenland 17.414524 2.2233926
2010 288 137822.2 female Burgenland 21.432598 2.0806293
2010 359 182732.9 male Vorarlberg 12.973259 3.0976410
2010 374 194622.1 female Vorarlberg 19.883637 3.7435376
2010 440 253143.7 male Salzburg 9.156964 1.8974748
2010 484 282307.3 female Salzburg 17.939382 2.5305822
2010 517 268581.4 male Carinthia 10.552148 2.0566143
2010 561 295066.6 female Carinthia 15.392924 1.9709443
2010 650 339566.5 male Tyrol 12.857542 1.0350617
2010 667 362332.5 female Tyrol 17.604861 2.2101768
2010 1128 571011.7 male Styria 11.671247 1.5095045
2010 1132 774405.4 male Vienna 15.590616 1.3336364
2010 1167 596033.3 female Styria 16.964539 1.3743809
2010 1190 824525.6 female Vienna 18.778813 0.9295877
2010 1363 684272.5 male Upper Austria 9.074690 1.1700859
2010 1387 772593.2 female Lower Austria 16.372949 1.8349672
2010 1417 783115.8 male Lower Austria 11.348283 1.6423009
2010 1442 737347.5 female Upper Austria 12.574205 0.7703255
2010 14827 8182222.0 NA NA 14.444218 0.5829182

Option 3: Cobination of Option 1 and Option 2

In this case, the estimates and standard errors are calculated for

  • every gender,
  • every state and
  • every combination of state and gender.

The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]

year n N gender region val_povertyRisk stE_povertyRisk
2010 261 122741.8 male Burgenland 17.414524 2.2233926
2010 288 137822.2 female Burgenland 21.432598 2.0806293
2010 359 182732.9 male Vorarlberg 12.973259 3.0976410
2010 374 194622.1 female Vorarlberg 19.883637 3.7435376
2010 440 253143.7 male Salzburg 9.156964 1.8974748
2010 484 282307.3 female Salzburg 17.939382 2.5305822
2010 517 268581.4 male Carinthia 10.552148 2.0566143
2010 549 260564.0 NA Burgenland 19.539836 1.7963347
2010 561 295066.6 female Carinthia 15.392924 1.9709443
2010 650 339566.5 male Tyrol 12.857542 1.0350617
2010 667 362332.5 female Tyrol 17.604861 2.2101768
2010 733 377355.0 NA Vorarlberg 16.537310 3.2774567
2010 924 535451.0 NA Salzburg 13.787343 2.2523362
2010 1078 563648.0 NA Carinthia 13.086268 1.7427986
2010 1128 571011.7 male Styria 11.671247 1.5095045
2010 1132 774405.4 male Vienna 15.590616 1.3336364
2010 1167 596033.3 female Styria 16.964539 1.3743809
2010 1190 824525.6 female Vienna 18.778813 0.9295877
2010 1317 701899.0 NA Tyrol 15.308191 1.3916058
2010 1363 684272.5 male Upper Austria 9.074690 1.1700859
2010 1387 772593.2 female Lower Austria 16.372949 1.8349672
2010 1417 783115.8 male Lower Austria 11.348283 1.6423009
2010 1442 737347.5 female Upper Austria 12.574205 0.7703255
2010 2295 1167045.0 NA Styria 14.374637 1.3485703
2010 2322 1598931.0 NA Vienna 17.234683 1.0419000
2010 2804 1555709.0 NA Lower Austria 13.843623 1.7019212
2010 2805 1421620.0 NA Upper Austria 10.889773 0.8701448
2010 7267 3979571.7 male NA 12.026600 0.6294860
2010 7560 4202650.3 female NA 16.733508 0.6290441
2010 14827 8182222.0 NA NA 14.444218 0.5829182