Analyzing network scale-up data using the networkreporting package

Introduction

The networkreporting package has several tools for analyzing survey data that have been collected using the network scale-up method.

This introduction will assume that you already have the networkreporting package installed. If you don't, please refer to the introductory vignette (“getting started”) for instructions on how to do this.

Review of the network scale-up method

For the purposes of this vignette, we'll assume that you have conducted a survey using network scale-up questions in order to estimate the size of a hidden population. Analytically, using the scale-up estimator involves two steps:

step 1: estimating the size of the survey respondents' personal networks (their degrees)
step 2: estimating the size of the hidden population by combining the estimated network sizes (from step 1) with the number of connections to the hidden population

We'll quickly review each of these steps, and then we'll show how to use the package to carry the estimation out.

Step 1: estimating network sizes

Here, we will use the known population estimator for respondents' degrees (Killworth et al., 1998; Feehan and Salganik, 2016). In order to estimate the degree of the \(i\) th survey respondent, we use

\[ \begin{align} \label{eqn:kpdegree} \hat{d_i} = \sum_{j=1}^{K} y_{ij} \times \frac{N}{\sum_{j=1}^{K} N_j}, \end{align} \]

where \(N\) is the total size of the population, \(N_j\) is the size of the \(j\) th population of known size, and \(y_{ij}\) is the number of connections that survey respondent \(i\) reports between herself and members of the \(j\) th population of known size.

Step 2: estimating hidden population sizes

Once we have the estimates of the respondents' degrees, we use them to produce an estimate for the size of the hidden population:

\[ \begin{align} \label{eqn:nsum} \hat{N}_h = \frac{ \sum_{i \in s} y_{ih} }{ \sum_{i \in s} \hat{d_i} }, \end{align} \]

where \(N_h\) is the size of the population of interest (which we want to estimate), \(s\) is the set of respondents in our sample, and \(\hat{d_i}\) is the estimate of the size of respondent \(i\)'s degree, obtained using the known population method.

Preparing data

In order to use the package, we will assume that you start with two datasets: the first is a survey containing information collected from respondents about their personal networks; the second is information about the sizes of several populations.

The example data for this vignette are provided with the networkreporting package, and can be loaded by typing

library(networkreporting)
library(surveybootstrap)

## column names for connections to hidden population numbers
hidden.q <- c("sex.workers", "msm", "idu", "clients")

## column names for connections to groups of known size
hm.q <- c("widower", "nurse.or.doctor", "male.community.health", "teacher", 
          "woman.smoke", "priest", "civil.servant", "woman.gave.birth", 
          "muslim", "incarcerated", "judge", "man.divorced", "treatedfortb", 
          "nsengimana", "murekatete", "twahirwa", "mukandekezi", "nsabimana", 
          "mukamana", "ndayambaje", "nyiraneza", "bizimana", "nyirahabimana", 
          "ndagijimana", "mukandayisenga", "died")

## size of the entire population
tot.pop.size <- 10718378

The example data include two datasets: one has all of the responses from a network scale-up survey, and the other has the known population sizes for use with the known population estimator.

Preparing the known population data

The demo known population data are in example.knownpop.dat:

example.knownpop.dat

##               known.popn   size
## 1                widower  36147
## 2        nurse.or.doctor   7807
## 3  male.community.health  22000
## 4                teacher  47745
## 5            woman.smoke 119438
## 6                 priest   1004
## 7       woman.gave.birth 256164
## 8                 muslim 195449
## 9           incarcerated  68000
## 10          man.divorced  50698
## 11            nsengimana  32528
## 12            murekatete  30531
## 13              twahirwa  10420
## 14           mukandekezi  10520
## 15             nsabimana  48560
## 16              mukamana  51449
## 17            ndayambaje  22724
## 18             nyiraneza  21705
## 19              bizimana  38497
## 20         nyirahabimana  42727
## 21           ndagijimana  37375
## 22        mukandayisenga  35055

example.knownpop.dat is very simple: one column has a name for each known population, and the other has its toal size. We expect that users will typically start with a small dataset like this one. When using the networkreporting package, it is more useful to have a vector whose entries are known population sizes and whose names are the known population names. The df.to.kpvec function makes it easy for us to create it:

kp.vec <- df.to.kpvec(example.knownpop.dat, kp.var="known.popn", kp.value="size")

kp.vec

##               widower       nurse.or.doctor male.community.health 
##                 36147                  7807                 22000 
##               teacher           woman.smoke                priest 
##                 47745                119438                  1004 
##      woman.gave.birth                muslim          incarcerated 
##                256164                195449                 68000 
##          man.divorced            nsengimana            murekatete 
##                 50698                 32528                 30531 
##              twahirwa           mukandekezi             nsabimana 
##                 10420                 10520                 48560 
##              mukamana            ndayambaje             nyiraneza 
##                 51449                 22724                 21705 
##              bizimana         nyirahabimana           ndagijimana 
##                 38497                 42727                 37375 
##        mukandayisenga 
##                 35055

Finally, we also need to know the total size of the population we are making estimates about. In this case, let's assume that we're working in a country of 10 million people:

# total size of the population
tot.pop.size <- 10e6

Preparing the survey data

Now let's take a look at the demo survey dataset, which is called example.survey:

head(example.survey)

##      id cluster      region indweight  sex age.cat widower nurse.or.doctor
## 1 1.1.1       1 Kigali City  0.330602 Male [25,35)       3               2
## 2 1.1.2       1 Kigali City  0.330602 Male [25,35)       0               2
## 3 1.1.3       1 Kigali City  0.330602 Male [25,35)       2               8
## 4 1.1.4       1 Kigali City  0.330602 Male [25,35)       0               1
## 5 1.1.5       1 Kigali City  0.330602 Male [25,35)       0               0
## 6 1.1.6       1 Kigali City  0.330602 Male [25,35)       7               4
##   male.community.health teacher woman.smoke priest civil.servant
## 1                     1       5           1      0             5
## 2                     1       5           0      0             5
## 3                     0       3           0      0            50
## 4                     0       0           0      0             5
## 5                     0       0           0      0             5
## 6                     0       8           2      0             6
##   woman.gave.birth muslim incarcerated judge man.divorced treatedfortb
## 1                3      4            2     3            2            0
## 2                3      0            2     3            1            0
## 3                4      3            0     0            2            0
## 4                0      0            0     0            0            0
## 5                0      0            0     0            1            0
## 6                3      4            3     0            1            0
##   nsengimana murekatete twahirwa mukandekezi nsabimana mukamana ndayambaje
## 1          0          0        2           1         2        3          1
## 2          3          2        0           0         1        0          0
## 3          0          0        0           1         2        0          0
## 4          1          0        0           0         0        0          0
## 5          0          0        0           0         0        0          0
## 6          1          1        0           0         0        0          0
##   nyiraneza bizimana nyirahabimana ndagijimana mukandayisenga died
## 1         0        2             0           1              0    0
## 2         0        0             0           0              0    1
## 3         0        0             0           0              0    2
## 4         0        0             0           0              0    0
## 5         0        0             0           0              0    0
## 6         0        0             0           0              0    4
##   sex.workers msm idu clients
## 1           0   0   0       2
## 2           0   0   0       1
## 3           0   0   0       0
## 4           0   0   0       0
## 5           0   0   0       0
## 6           0   0   0      10

The columns fall into a few categories:

an id variable for each respondent: id
information related to the sampling design of the survey: cluster, region, and indweight.
demographic characteristics of the respondents: sex and age.cat
responses to questiona bout populations whose total size is known: widower, …, mukandayisenga
questions about hidden populations: died, …, clients

This is the general form that your survey dataset should have.

Topcoding

Many network scale-up studies have topcoded the responses to the aggregate relational data questions. This means that researchers considered any responses above a certain value, called the topcode, to be implausible. Before proceeding with the analysis, researchers substitute the maximum plausible value in for the implausible ones. For example, in many studies, researchers replaced responses with the value 31 or higher with the value 30 before conducting their analysis (see Zheng, Salganik, and Gelman 2006).

We won't discuss whether or not this is advisable here, but this is currently a common practice in scale-up studies. If you wish to follow it, you can use the topcode.data function. For example, let's topcode the responses to the questions about populations of known size to the value 30. First, we'll examine the distribution of the responses before topcoding:

## make a vector with the list of known population names from
## our dataset of known population totals
known.popn.vars <- paste(example.knownpop.dat$known.popn)

## before topcoding: max. response for several popns is > 30
summary(example.survey[,known.popn.vars])

##     widower        nurse.or.doctor   male.community.health
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.000       
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.000       
##  Median : 0.0000   Median : 0.0000   Median : 0.000       
##  Mean   : 0.6101   Mean   : 0.5112   Mean   : 0.724       
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.000       
##  Max.   :95.0000   Max.   :40.0000   Max.   :95.000       
##                                                           
##     teacher        woman.smoke         priest        woman.gave.birth
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.000   Median : 0.0000   Median : 1.000  
##  Mean   : 1.356   Mean   : 1.022   Mean   : 0.1484   Mean   : 1.885  
##  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 0.0000   3rd Qu.: 3.000  
##  Max.   :95.000   Max.   :95.000   Max.   :25.0000   Max.   :30.000  
##                                                                      
##      muslim        incarcerated      man.divorced       nsengimana    
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 0.000   Median : 0.0000   Median : 0.0000   Median :0.0000  
##  Mean   : 2.094   Mean   : 0.4348   Mean   : 0.3367   Mean   :0.3603  
##  3rd Qu.: 1.000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.0000  
##  Max.   :95.000   Max.   :95.0000   Max.   :20.0000   Max.   :8.0000  
##  NA's   :1                                                            
##    murekatete         twahirwa        mukandekezi      nsabimana      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median :0.000   Median : 0.0000  
##  Mean   : 0.3425   Mean   : 0.2394   Mean   :0.165   Mean   : 0.4705  
##  3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.:0.000   3rd Qu.: 1.0000  
##  Max.   :12.0000   Max.   :10.0000   Max.   :7.000   Max.   :20.0000  
##                                                                       
##     mukamana         ndayambaje        nyiraneza          bizimana      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.4144   Mean   : 0.3296   Mean   : 0.2685   Mean   : 0.4331  
##  3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
##  Max.   :15.0000   Max.   :30.0000   Max.   :10.0000   Max.   :12.0000  
##                                                                         
##  nyirahabimana     ndagijimana      mukandayisenga   
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.261   Mean   : 0.3279   Mean   : 0.2577  
##  3rd Qu.: 0.000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :17.000   Max.   :10.0000   Max.   :20.0000  
##

Several populations, including widower, male.community.health, teacher, woman.smoke, muslim, and incarcerated have maximum values that are very high. (It turns out that 95 is the highest value that could be recorded during the interviews; if respondents said that they were connected to more than 95 people in the group, the interviewers wrote 95 down.)

Now we use the topcode.data function to topcode all of the responses at 30:

example.survey <- topcode.data(example.survey,
                               vars=known.popn.vars,
                               max=30)

## after topcoding: max. response for all popns is 30
summary(example.survey[,known.popn.vars])

##     widower        nurse.or.doctor   male.community.health
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.000       
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.000       
##  Median : 0.0000   Median : 0.0000   Median : 0.000       
##  Mean   : 0.5831   Mean   : 0.5062   Mean   : 0.653       
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.000       
##  Max.   :30.0000   Max.   :30.0000   Max.   :30.000       
##                                                           
##     teacher        woman.smoke          priest        woman.gave.birth
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.0000   Median : 0.0000   Median : 1.000  
##  Mean   : 1.216   Mean   : 0.9638   Mean   : 0.1484   Mean   : 1.885  
##  3rd Qu.: 1.000   3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.: 3.000  
##  Max.   :30.000   Max.   :30.0000   Max.   :25.0000   Max.   :30.000  
##                                                                       
##      muslim        incarcerated      man.divorced       nsengimana    
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 0.000   Median : 0.0000   Median : 0.0000   Median :0.0000  
##  Mean   : 1.468   Mean   : 0.3807   Mean   : 0.3367   Mean   :0.3603  
##  3rd Qu.: 1.000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.0000  
##  Max.   :30.000   Max.   :30.0000   Max.   :20.0000   Max.   :8.0000  
##  NA's   :1                                                            
##    murekatete         twahirwa        mukandekezi      nsabimana      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median :0.000   Median : 0.0000  
##  Mean   : 0.3425   Mean   : 0.2394   Mean   :0.165   Mean   : 0.4705  
##  3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.:0.000   3rd Qu.: 1.0000  
##  Max.   :12.0000   Max.   :10.0000   Max.   :7.000   Max.   :20.0000  
##                                                                       
##     mukamana         ndayambaje        nyiraneza          bizimana      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.4144   Mean   : 0.3296   Mean   : 0.2685   Mean   : 0.4331  
##  3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
##  Max.   :15.0000   Max.   :30.0000   Max.   :10.0000   Max.   :12.0000  
##                                                                         
##  nyirahabimana     ndagijimana      mukandayisenga   
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.000   Median : 0.0000   Median : 0.0000  
##  Mean   : 0.261   Mean   : 0.3279   Mean   : 0.2577  
##  3rd Qu.: 0.000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :17.000   Max.   :10.0000   Max.   :20.0000  
##

If you look at the help page for topcode.data, you'll see that it can also handle situations where the variables can take on special codes for missing values, refusals, and so forth.

Estimating network sizes

Now that we have finished preparing the data, we turn to esimating the sizes of each respondent's personal network. To do this using the known population estimator, we use the kp.degree.estimator function:

d.hat <- kp.degree.estimator(survey.data=example.survey,
                             known.popns=kp.vec,
                             total.popn.size=tot.pop.size,
                             missing="complete.obs")

## Warning in kp.degree.estimator(survey.data = example.survey, known.popns =
## kp.vec, : kp.degree.estimator will be deprecated soon!

summary(d.hat)

##        V1         
##  Min.   :   0.00  
##  1st Qu.:  25.28  
##  Median :  58.99  
##  Mean   : 101.22  
##  3rd Qu.: 126.42  
##  Max.   :1904.69

We can examine the results with a histogram

library(ggplot2) # we'll use qplot from ggplot2 for plots
theme_set(theme_minimal())

qplot(d.hat, binwidth=25)

plot of chunk unnamed-chunk-10

Now let's append the degree estimates to the survey reports dataframe:

example.survey$d.hat <- d.hat

Missing data in known population questions

TODO

Estimating hidden population size

Now that you have estimated degrees, you can use them to produce estimates of the size of the hidden population. Here, we'll take the example of injecting drug users, idu

idu.est <- nsum.estimator(survey.data=example.survey,
                          d.hat.vals=d.hat,
                          total.popn.size=tot.pop.size,
                          y.vals="idu",
                          missing="complete.obs")

Note that we had to specify that we should use only rows in our dataset with no missing values through the missing = "complete.obs" option, and also that we had to pass in the total population size using the total.popn.size option. The resulting estimate is

idu.est

## $estimate
## [1] 1067.656
## 
## $tot.connections
## [1] 26
## 
## $sum.d.hat
## [1] 243524.2

This returns the estimate, and also the numerator and denominator used to compute it.

Variance estimation

In order to estimate the sampling uncertainty of our estimated totals, we can use the rescaled bootstrap technique; see Feehan and Salganik 2016 for more about the rescaled boostrap and how it can be applied to the network scale-up method. In order to use the rescaled boostrap, you need to be able to specify the sampling design of your study. In particular, you need to be able to describe the stratifcation (if any) and the primary sampling units used in the study.

idu.est <- bootstrap.estimates(## this describes the sampling design of the
                               ## survey; here, the PSUs are given by the
                               ## variable cluster, and the strata are given
                               ## by the variable region
                               survey.design = ~ cluster + strata(region),
                               ## the number of bootstrap resamples to obtain
                               ## (NOTE: in practice, you should use more than 100.
                               ##  this keeps building the package relatively fast)
                               num.reps=100,
                               ## this is the name of the function
                               ## we want to use to produce an estimate
                               ## from each bootstrapped dataset
                               estimator.fn="nsum.estimator",
                               ## these are the sampling weights
                               weights="indweight",
                               ## this is the name of the type of bootstrap
                               ## we wish to use
                               bootstrap.fn="rescaled.bootstrap.sample",
                               ## our dataset
                               survey.data=example.survey,
                               ## other parameters we need to pass
                               ## to the nsum.estimator function
                               d.hat.vals=d.hat,
                               total.popn.size=tot.pop.size,
                               y.vals="idu",
                               missing="complete.obs")

By default, bootstrap.estimates produces a list with num.reps entries; each entry is the result of calling the estimator function on one bootstrap resample.

Next, you can write a bit of code that will help us put all of these results together, for plotting and summarizing

library(plyr)
## combine the estimates together in one data frame
## (bootstrap.estimates gives us a list)
all.idu.estimates <- ldply(idu.est,
                           function(x) { data.frame(estimate=x$estimate) })

We can examine the summarized results with a histogram or with summarize.

## look at a histogram of the results
qplot(all.idu.estimates$estimate, binwidth=50)

plot of chunk unnamed-chunk-16

## summarize the results
summary(all.idu.estimates$estimate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   155.4   395.5   542.5   593.5   715.8  1615.9

To produce 95% intervals using the percentile method you can do something like this

quantile(all.idu.estimates$estimate, probs=c(0.025, 0.975))

##      2.5%     97.5% 
##  190.3353 1199.1265

Internal valididty checks

If you want to run internal validation checks (see e.g. Salganik et al., 2011, Fig 3), you can use the nsum.internal.validation function. We specify that we wish to use only complete observations (ie, we will remove rows that have any missing values from our calculations).

iv.result <- nsum.internal.validation(survey.data=example.survey,
                                      known.popns=kp.vec,
                                      missing="complete.obs",
                                      killworth.se=TRUE,
                                      total.popn.size=tot.pop.size,
                                      kp.method=TRUE,
                                      return.plot=TRUE)

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

## Warning in kp.degree.estimator(survey.data = survey.data, known.popns =
## kp.minus, : kp.degree.estimator will be deprecated soon!

Now iv.result is a list that has a summary of the results in the entry results

iv.result$results

##                     name nsum.holdout.est known.size d.hat.sum
## 1                widower         58707.75      36147  238980.4
## 2        nurse.or.doctor         51873.04       7807  234804.1
## 3  male.community.health         66955.19      22000  234634.5
## 4                teacher        128261.75      47745  228049.3
## 5            woman.smoke         93113.92     119438  249049.8
## 6                 priest         14830.56       1004  240719.2
## 7       woman.gave.birth        173202.90     256164  261831.6
## 8                 muslim        137934.15     195449  255929.3
## 9           incarcerated         36619.52      68000  250139.8
## 10          man.divorced         32758.68      50698  247262.7
## 11            nsengimana         35697.29      32528  242875.6
## 12            murekatete         33933.49      30531  242828.0
## 13              twahirwa         23921.82      10420  240784.3
## 14           mukandekezi         16382.83      10520  242326.9
## 15             nsabimana         46399.42      48560  243968.6
## 16              mukamana         40565.23      51449  245777.0
## 17            ndayambaje         32841.14      22724  241465.4
## 18             nyiraneza         26637.40      21705  242516.1
## 19              bizimana         42948.88      38497  242614.0
## 20         nyirahabimana         25411.65      42727  247130.7
## 21           ndagijimana         32259.55      37375  244578.7
## 22        mukandayisenga         25249.07      35055  245553.6
##    killworth.se killworth.se.wgtdenom        err    abserr      sqerr
## 1     1562.7448             1562.7448  22560.754 22560.754  508987620
## 2     1482.4796             1482.4796  44066.040 44066.040 1941815910
## 3     1683.5939             1683.5939  44955.188 44955.188 2020968903
## 4     2356.3024             2356.3024  80516.747 80516.747 6482946587
## 5     1924.5661             1924.5661 -26324.077 26324.077  692957033
## 6      784.3342              784.3342  13826.559 13826.559  191173736
## 7     2549.6027             2549.6027 -82961.098 82961.098 6882543711
## 8     2305.5174             2305.5174 -57514.848 57514.848 3307957792
## 9     1207.7257             1207.7257 -31380.477 31380.477  984734336
## 10    1149.1360             1149.1360 -17939.316 17939.316  321819057
## 11    1210.1778             1210.1778   3169.293  3169.293   10044418
## 12    1180.1217             1180.1217   3402.489  3402.489   11576935
## 13     995.5497              995.5497  13501.822 13501.822  182299200
## 14     821.5568              821.5568   5862.829  5862.829   34372766
## 15    1375.8765             1375.8765  -2160.579  2160.579    4668100
## 16    1282.1053             1282.1053 -10883.772 10883.772  118456503
## 17    1164.3067             1164.3067  10117.144 10117.144  102356610
## 18    1046.6378             1046.6378   4932.402  4932.402   24328592
## 19    1327.6492             1327.6492   4451.879  4451.879   19819224
## 20    1012.7461             1012.7461 -17315.346 17315.346  299821223
## 21    1146.6165             1146.6165  -5115.449  5115.449   26167818
## 22    1012.7460             1012.7460  -9805.932  9805.932   96156302
##         relerr
## 1   0.62413904
## 2   5.64442684
## 3   2.04341762
## 4   1.68639119
## 5   0.22039951
## 6  13.77147320
## 7   0.32385932
## 8   0.29427036
## 9   0.46147760
## 10  0.35384662
## 11  0.09743276
## 12  0.11144376
## 13  1.29576028
## 14  0.55730315
## 15  0.04449297
## 16  0.21154488
## 17  0.44521846
## 18  0.22724728
## 19  0.11564223
## 20  0.40525538
## 21  0.13686820
## 22  0.27972991

Since we passed the argument return.plot=TRUE to the function, we also get a plot:

print(iv.result$plot)

plot of chunk unnamed-chunk-20

This plot is a ggplot2 object, so we can customize it if we want. As a very simple example, we can change the title:

print(iv.result$plot + ggtitle("internal validation checks"))

plot of chunk unnamed-chunk-21

The ggplot2 website has more information on modifying ggplot2 objects.

Attaching known population totals to the dataframe

Several of the functions we demonstrated above required us to pass in the vector containing the known population sizes and also the size of the total population. We can avoid this step by attaching these two pieces of information to the survey dataframe using the add.kp function:

example.survey <- add.kp(example.survey, kp.vec, tot.pop.size)

d.hat.new <- kp.degree.estimator(survey.data=example.survey,
                                 # we don't need this anymore, since we
                                 # them to survey.data's attributes using add.kp
                                 #known.popns=kp.vec,
                                 #total.popn.size=tot.pop.size,
                                 missing="complete.obs")

## Warning in kp.degree.estimator(survey.data = example.survey, missing =
## "complete.obs"): kp.degree.estimator will be deprecated soon!

summary(d.hat.new)

##        V1         
##  Min.   :   0.00  
##  1st Qu.:  25.28  
##  Median :  58.99  
##  Mean   : 101.22  
##  3rd Qu.: 126.42  
##  Max.   :1904.69

This is exactly the same result we obtained before.