Error estimation

For the most part, this document will present the functionalities of the function surveysd::calc.stError() which generates point estimates and standard errors for user-supplied estimation functions.

Prerequisites

In order to use a dataset with calc.stError(), several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from demo.eusilc() and attach the bootstrap weights using draw.bootstrap() and recalib(). Please refer to the documentation of those functions for more detail.

library(surveysd)

set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
                           strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
                          epsP = 1e-2, epsH = 2.5e-2, verbose = TRUE)

## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps

dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]

## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]

year	povertyRisk	eqIncome	onePerson	pWeight	w1	w2	w3	w4	w5
2010	FALSE	16090.69	FALSE	504.5696	1.459066	1006.555	1.455580	1.462730	1018.671
2010	FALSE	16090.69	FALSE	504.5696	1.459066	1006.555	1.455580	1.462730	1018.671
2010	FALSE	16090.69	FALSE	504.5696	1.459066	1006.555	1.455580	1.462730	1018.671
2011	FALSE	16090.69	FALSE	504.5696	1.380779	967.380	1.456154	1.516569	1004.087
2011	FALSE	16090.69	FALSE	504.5696	1.380779	967.380	1.456154	1.516569	1004.087

Estimator functions

The parameters fun and var in calc.stError() define the estimator to be used in the error analysis. There are two built-in estimator functions weightedSum() and weightedRatio() which can be used as follows.

povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)

Those functions calculate the ratio of persons at risk of povery (in percent) and the total income. By default, the results are calculated seperately for each reference period.

povertyRate$Estimates

year	n	N	val_povertyRisk	stE_povertyRisk
2010	14827	8182222	14.44422	0.5829182
2011	14827	8182222	14.77393	0.6761383
2012	14827	8182222	15.04515	0.4903267
2013	14827	8182222	14.89013	0.4751962
2014	14827	8182222	15.14556	0.5188642
2015	14827	8182222	15.53640	0.4798736
2016	14827	8182222	15.08315	0.3559527
2017	14827	8182222	15.42019	0.5582783

totalIncome$Estimates

year	n	N	val_eqIncome	stE_eqIncome
2010	14827	8182222	162750998071	959510036
2011	14827	8182222	161926931417	900201396
2012	14827	8182222	162576509628	1195965007
2013	14827	8182222	163199507862	1472304274
2014	14827	8182222	163986275009	1525837944
2015	14827	8182222	163416275447	1335217370
2016	14827	8182222	162706205137	1204524680
2017	14827	8182222	164314959107	1320123257

Columns that use the val_ prefix denote the point estimate belonging to the “main weight” of the dataset, which is pWeight in case of the dataset used here.

Columns with the stE_ prefix denote standard errors calculated with bootstrap replicates. The replicates result in using w1, w2, …, w10 instead of pWeight when applying the estimator.

n denotes the number of observations for the year and N denotes the total weight of those persons.

Custom estimators

In order to define a custom estimator function to be used in fun, the function needs to have two arguments like the example below.

## define custom estimator
myWeightedSum <- function(x, w) {
  sum(x*w)
}

## check if results are equal to the one using `suveysd::weightedSum()`
totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
all.equal(totalIncome$Estimates, totalIncome2$Estimates)

## [1] TRUE

The parameters x and w can be assumed to be vectors with equal length with w being numeric and x being the column defined in the var argument. It will be called once for each period (in this case year) and for each weight column (in this case pWeight, w1, w2, …, w10).

Multiple estimators

In case an estimator should be applied to several columns of the dataset, var can be set to a vector containing all necessary columns.

multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates$Estimates

year	n	N	val_povertyRisk	stE_povertyRisk	val_onePerson	stE_onePerson
2010	14827	8182222	14.44422	0.5829182	14.85737	0.2568455
2011	14827	8182222	14.77393	0.6761383	14.85737	0.2774971
2012	14827	8182222	15.04515	0.4903267	14.85737	0.2905952
2013	14827	8182222	14.89013	0.4751962	14.85737	0.3452929
2014	14827	8182222	15.14556	0.5188642	14.85737	0.4386546
2015	14827	8182222	15.53640	0.4798736	14.85737	0.3803537
2016	14827	8182222	15.08315	0.3559527	14.85737	0.3099259
2017	14827	8182222	15.42019	0.5582783	14.85737	0.3025539

Here we see the relative number of persons at risk of poverty and the relative number of one-person households.

Grouping

The groups argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in dat. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used.

dat2 <- subset(dat_boot_calib, year == 2010)
for (att  in c("period", "weights", "b.rep"))
  attr(dat2, att) <- attr(dat_boot_calib, att)

To calculate the ratio of persons at risk of poverty for each federal state of austria, group = "region" can be used.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates

year	n	N	region	val_povertyRisk	stE_povertyRisk
2010	549	260564	Burgenland	19.53984	1.7963347
2010	733	377355	Vorarlberg	16.53731	3.2774567
2010	924	535451	Salzburg	13.78734	2.2523362
2010	1078	563648	Carinthia	13.08627	1.7427986
2010	1317	701899	Tyrol	15.30819	1.3916058
2010	2295	1167045	Styria	14.37464	1.3485703
2010	2322	1598931	Vienna	17.23468	1.0419000
2010	2804	1555709	Lower Austria	13.84362	1.7019212
2010	2805	1421620	Upper Austria	10.88977	0.8701448
2010	14827	8182222	NA	14.44422	0.5829182

The last column with region = NA denotes the aggregate over all regions. Note that the columns N and n now show the weighted and unweighted number of persons in each region.

Several grouping variables

In case more than one grouping variable is used, there are several options of calling calc.stError() depending on whether combinations of grouping levels should be regarded or not. We will consider the variables gender and region as our grouping variables and show three options on how calc.stError() can be called.

Option 1: All regions and all genders

Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]

The last row is again the estimate for the whole period.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = c("gender", "region"))
povertyRates$Estimates

year	n	N	gender	region	val_povertyRisk	stE_povertyRisk
2010	549	260564	NA	Burgenland	19.53984	1.7963347
2010	733	377355	NA	Vorarlberg	16.53731	3.2774567
2010	924	535451	NA	Salzburg	13.78734	2.2523362
2010	1078	563648	NA	Carinthia	13.08627	1.7427986
2010	1317	701899	NA	Tyrol	15.30819	1.3916058
2010	2295	1167045	NA	Styria	14.37464	1.3485703
2010	2322	1598931	NA	Vienna	17.23468	1.0419000
2010	2804	1555709	NA	Lower Austria	13.84362	1.7019212
2010	2805	1421620	NA	Upper Austria	10.88977	0.8701448
2010	7267	3979572	male	NA	12.02660	0.6294860
2010	7560	4202650	female	NA	16.73351	0.6290441
2010	14827	8182222	NA	NA	14.44422	0.5829182

Option 2: All combinations of `state` and `gender`

Split the data by all cobinations of the two grouping variables. This will result in a larger output-table of the size

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list(c("gender", "region")))
povertyRates$Estimates

year	n	N	gender	region	val_povertyRisk	stE_povertyRisk
2010	261	122741.8	male	Burgenland	17.414524	2.2233926
2010	288	137822.2	female	Burgenland	21.432598	2.0806293
2010	359	182732.9	male	Vorarlberg	12.973259	3.0976410
2010	374	194622.1	female	Vorarlberg	19.883637	3.7435376
2010	440	253143.7	male	Salzburg	9.156964	1.8974748
2010	484	282307.3	female	Salzburg	17.939382	2.5305822
2010	517	268581.4	male	Carinthia	10.552148	2.0566143
2010	561	295066.6	female	Carinthia	15.392924	1.9709443
2010	650	339566.5	male	Tyrol	12.857542	1.0350617
2010	667	362332.5	female	Tyrol	17.604861	2.2101768
2010	1128	571011.7	male	Styria	11.671247	1.5095045
2010	1132	774405.4	male	Vienna	15.590616	1.3336364
2010	1167	596033.3	female	Styria	16.964539	1.3743809
2010	1190	824525.6	female	Vienna	18.778813	0.9295877
2010	1363	684272.5	male	Upper Austria	9.074690	1.1700859
2010	1387	772593.2	female	Lower Austria	16.372949	1.8349672
2010	1417	783115.8	male	Lower Austria	11.348283	1.6423009
2010	1442	737347.5	female	Upper Austria	12.574205	0.7703255
2010	14827	8182222.0	NA	NA	14.444218	0.5829182

Option 3: Cobination of Option 1 and Option 2

In this case, the estimates and standard errors are calculated for

every gender,
every state and
every combination of state and gender.

The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list("gender", "region", c("gender", "region")))
povertyRates$Estimates

year	n	N	gender	region	val_povertyRisk	stE_povertyRisk
2010	261	122741.8	male	Burgenland	17.414524	2.2233926
2010	288	137822.2	female	Burgenland	21.432598	2.0806293
2010	359	182732.9	male	Vorarlberg	12.973259	3.0976410
2010	374	194622.1	female	Vorarlberg	19.883637	3.7435376
2010	440	253143.7	male	Salzburg	9.156964	1.8974748
2010	484	282307.3	female	Salzburg	17.939382	2.5305822
2010	517	268581.4	male	Carinthia	10.552148	2.0566143
2010	549	260564.0	NA	Burgenland	19.539836	1.7963347
2010	561	295066.6	female	Carinthia	15.392924	1.9709443
2010	650	339566.5	male	Tyrol	12.857542	1.0350617
2010	667	362332.5	female	Tyrol	17.604861	2.2101768
2010	733	377355.0	NA	Vorarlberg	16.537310	3.2774567
2010	924	535451.0	NA	Salzburg	13.787343	2.2523362
2010	1078	563648.0	NA	Carinthia	13.086268	1.7427986
2010	1128	571011.7	male	Styria	11.671247	1.5095045
2010	1132	774405.4	male	Vienna	15.590616	1.3336364
2010	1167	596033.3	female	Styria	16.964539	1.3743809
2010	1190	824525.6	female	Vienna	18.778813	0.9295877
2010	1317	701899.0	NA	Tyrol	15.308191	1.3916058
2010	1363	684272.5	male	Upper Austria	9.074690	1.1700859
2010	1387	772593.2	female	Lower Austria	16.372949	1.8349672
2010	1417	783115.8	male	Lower Austria	11.348283	1.6423009
2010	1442	737347.5	female	Upper Austria	12.574205	0.7703255
2010	2295	1167045.0	NA	Styria	14.374637	1.3485703
2010	2322	1598931.0	NA	Vienna	17.234683	1.0419000
2010	2804	1555709.0	NA	Lower Austria	13.843623	1.7019212
2010	2805	1421620.0	NA	Upper Austria	10.889773	0.8701448
2010	7267	3979571.7	male	NA	12.026600	0.6294860
2010	7560	4202650.3	female	NA	16.733508	0.6290441
2010	14827	8182222.0	NA	NA	14.444218	0.5829182