Introduction to danstat

Introduction

This vignette introduces the danstat package by means of a simple use case to illustrate the data discovery process. The package is designed to give easy and intuitive access to the Danmarks Statistik (Statistics Denmark) Statistikbank API which allows access to all published data in StatBank. As a public institution, Statistics Denmark provides open access to the data, so the API doesn’t require authentication and has no rate limits. The data is free to re-use and reproduce even for commercial use, however, source accreditation to Statistics Denmark should be given. See terms for the Use and reuse of data.

library(danstat)
library(purrr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows

Exploring a subject

In this example we would like to examine alcohol related incidents and the consumption of alcohol in Denmark. An inital call to get_subjects() returns the following data frame:

get_subjects()
#>    id                               description active hasSubjects subjects
#> 1  02                  Population and elections   TRUE        TRUE     NULL
#> 2  05                         Living conditions   TRUE        TRUE     NULL
#> 3  03                   Education and knowledge   TRUE        TRUE     NULL
#> 4  18               Culture and National Church   TRUE        TRUE     NULL
#> 5  04                 Labour, income and wealth   TRUE        TRUE     NULL
#> 6  06                    Prices and consumption   TRUE        TRUE     NULL
#> 7  14 National accounts and government finances   TRUE        TRUE     NULL
#> 8  16                   Money and credit market   TRUE        TRUE     NULL
#> 9  13                          External economy   TRUE        TRUE     NULL
#> 10 07                Business sector in general   TRUE        TRUE     NULL
#> 11 11                          Business sectors   TRUE        TRUE     NULL
#> 12 01         Geography, environment and energy   TRUE        TRUE     NULL
#> 13 19                                     Other   TRUE        TRUE     NULL

The subjects of interest for us are “Living conditions” and “Prices and consumption”. We look into more details into these to subjects:

subj <- get_subjects(subjects = c("05","06"))
subsubjects <- subj$subjects %>% bind_rows()
subsubjects
#>      id                       description active hasSubjects subjects
#> 1  2585                   Gender equality   TRUE        TRUE     NULL
#> 2  2557                   Quality of life   TRUE        TRUE     NULL
#> 3  2477                Welfare indicators   TRUE        TRUE     NULL
#> 4  2415                           Housing   TRUE        TRUE     NULL
#> 5  2412                            Health   TRUE        TRUE     NULL
#> 6  2555                         Childcare   TRUE        TRUE     NULL
#> 7  2403 Persons receiving public benefits   TRUE        TRUE     NULL
#> 8  2404                 Social conditions   TRUE        TRUE     NULL
#> 9  2413                 Traffic accidents   TRUE        TRUE     NULL
#> 10 2414                 Criminal offences   TRUE        TRUE     NULL
#> 11 2436              Consumer price index   TRUE        TRUE     NULL
#> 12 2437                   Business prices   TRUE        TRUE     NULL
#> 13 2438                       Consumption   TRUE        TRUE     NULL
#> 14 2439                     Real property   TRUE        TRUE     NULL
#> 15 2440                              Cars   TRUE        TRUE     NULL
#> 16 2476                      Cost indices   TRUE        TRUE     NULL

Getting table information and variables

Now we would like to see what data tables are available for subjects “Traffic accidents” and “Consumption”:

tables <- get_tables(subjects = c("2413", "2438")) 
tables %>% 
  select(id, text, variables) %>% 
  kable()
id text variables
UHELD3 Road traffic accidents c(“type of accident”, “accident situation”, “urban area”, “speed limit”, “time”)
UHELD4 Road traffic accidents c(“type of accident”, “means of transport”, “hour”, “day of the week”, “month”, “time”)
UHELD5 Road traffic accidents c(“type of accident”, “accident situation”, “means of transport”, “other units involved”, “time”)
UHELD6 Road traffic accidents c(“type of accident”, “accident situation”, “means of transport”, “number of transport units involved”, “time”)
UHELDK7 Road traffic accidents c(“type of accident”, “region”, “urban area”, “accident situation”, “time”)
UHELD8 Injured and killed in road traffic accidents c(“type of accident”, “casualty”, “means of transport”, “sex”, “age”, “type of injury”, “time”)
UHELD9 Injured and killed in road traffic accidents c(“type of accident”, “casualty”, “means of transport”, “type of person”, “urban area”, “type of road”, “time”)
UHELD10 Injured and killed in road traffic accidents c(“type of accident”, “accident situation”, “casualty”, “means of transport”, “other units involved”, “time”)
UHELD11 Injured and killed in road traffic accidents c(“type of accident”, “casualty”, “means of transport”, “position”, “type of person”, “influenced by alcohol”, “time”)
UHELDK1 Injured and killed in road traffic accidents c(“region”, “casualty”, “motor vehicles involved”, “age”, “sex”, “time”)
UHELDK2 Injured and killed in alcohol accidents c(“region”, “casualty”, “age”, “sex”, “time”)
UHELD12 Drivers and pedestrians c(“means of transport”, “sex”, “age”, “age of driving license”, “level of alcohol in blood”, “time”)
MOERKE Injured in road traffic accidents reported by the police and casualty wards c(“reporter”, “accident situation”, “means of transport”, “sex”, “age”, “type of injury”, “time”)
BANE91 Fatalities and injuries in railway traffic accidents c(“railway system”, “category of person”, “casualty”, “time”)
BANE92 Fatalities and injuries in railway traffic accidents c(“railway system”, “type of accident”, “casualty”, “time”)
SKIB92 Accidents at sea involving Danish vessels c(“type of accident”, “extent”, “time”)
SKIB93 Accidents and fatalities onboard Danish vessels Danish ships c(“type of vessel”, “accidents”, “time”)
SKIB94 Accidents at sea in Danish sea territory c(“type of accident”, “waters”, “extent”, “time”)
FU01 Household budget suvevey c(“background information”, “group of households”, “time”)
FU02 Household average consumption c(“group of consumption”, “price unit”, “time”)
FU03 Consumption c(“group of consumption”, “households”, “price unit”, “time”)
FU04 Consumption c(“group of consumption”, “socioeconomic status”, “price unit”, “time”)
FU05 Consumption c(“group of consumption”, “total income”, “price unit”, “time”)
FU06 Consumption c(“group of consumption”, “type of dwelling”, “price unit”, “time”)
FU07 Consumption c(“group of consumption”, “region”, “price unit”, “time”)
FU08 Consumption c(“group of consumption”, “age”, “price unit”, “time”)
FU09 Average household income and consumption c(“type of income”, “time”)
ALKO2 Consumption and sales of alcohol and tobacco, subject to excises duties, by pop. c(“type”, “time”)
ALKO3 Consumption and sales of alcohol, subject to excises duties c(“type”, “time”)
ALKO4 Consumption and sales of alcohol and tobacco, subject to excises duties c(“type”, “time”)
VARFORBR The families possession of home appliances c(“type of consumption”, “time”)
FORV1 Consumer confidence indicator c(“indicator”, “time”)

If we would like to examine number of traffic accidents by hour of day (which is an interesing question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.

vars_acc <- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_alco <- get_table_metadata(table_id = "alko3", variables_only = TRUE)

vars_acc %>% 
  select(id, text)
#>         id               text
#> 1   UHELDA   type of accident
#> 2 TRANSMID means of transport
#> 3     KLOK               hour
#> 4      UGE    day of the week
#> 5      MND              month
#> 6      Tid               time

vars_alco %>% 
  select(id, text)
#>     id text
#> 1 TYPE type
#> 2  Tid time

Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.

vars_acc$values[1] %>% 
  kable()
id text
0000 All accidents
1000 Accidents involving persons influenced by alcohol
2000 Others accidents

vars_alco$values[1] %>% 
  kable()
id text
09 Sales of lager equivalents, total excl. tax free beer (index 2000=100) (2000-)
010 Sales of wine, total (indeks 2000=100)
011 Sales of spirits, total (indeks 2000=100)
0111 Sales of alcopops, total (indeks 2006=100)(2006-)
055 Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-
060 Consumption of wine, total (index 2000=100) (2000-)
065 Consumption of spirits, total (index 2000=100) (2000-)
070 Consumption of alcopops, total (index 2006=100) (2006-)

To compare the diurnal patterns of alcohol-related accidents vs. other accidents we will select values c(1000, 2000) from the “UHELDA” variable and values c("09", "055") (sales and consumption of lager equivalents) from the “TYPE” variable. As we would like data for all time periods (“Tid”) we will leave the values for that variable as NA.

Pulling data from the API

We can now pull the needed data from the API - first the accident numbers:

variable_codes <- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_values <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
accidents <- get_data("uheld4", variables = variable_input)
head(accidents) %>% kable()
UHELDA KLOK TID INDHOLD
Accidents involving persons influenced by alcohol 00-01 1997 94
Accidents involving persons influenced by alcohol 00-01 1998 85
Accidents involving persons influenced by alcohol 00-01 1999 81
Accidents involving persons influenced by alcohol 00-01 2000 87
Accidents involving persons influenced by alcohol 00-01 2001 77
Accidents involving persons influenced by alcohol 00-01 2002 91

and then the alcohol sales and consumption values:

variable_codes <- vars_alco$id 
variable_values <- list(c("055", "09"), NA) # All values for Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
alcohol <- get_data("alko3", variables = variable_input)
alcohol %>% 
  filter(INDHOLD != "..") %>% # the API returns ".." as missing values
  head() %>% 
  kable()
TYPE TID INDHOLD
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2000 100
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2001 97
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2002 96
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2003 98
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2004 93
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2005 93

Analysis 1: Duirnal patterns of traffic accidents

An interesting analysis would be to see at which time of day most accidents occur:

accidents_by_hour <- accidents %>%
    filter(KLOK != "Not stated") %>%
    group_by(UHELDA, KLOK) %>%
    summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))

accidents_by_hour %>%
    ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
    geom_line() +
    geom_point() +
    theme_bw() + 
  theme(legend.position="top") +
  labs(x = "Time of day", y = "Average annual accidents")

A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50, occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.