Introduction to danstat

Introduction

This vignette introduces the danstat package by means of a simple use case to illustrate the data discovery process. The package is designed to give easy and intuitive access to the Danmarks Statistik (Statistics Denmark) Statistikbank API which allows access to all published data in StatBank. As a public institution, Statistics Denmark provides open access to the data, so the API doesn’t require authentication and has no rate limits. The data is free to re-use and reproduce even for commercial use, however, source accreditation to Statistics Denmark should be given. See terms for the Use and reuse of data.

library(danstat)
library(purrr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows

Exploring a subject

In this example we would like to examine alcohol related incidents and the consumption of alcohol in Denmark. An inital call to get_subjects() returns the following data frame:

get_subjects()
#>    id                               description active hasSubjects subjects
#> 1  02                  Population and elections   TRUE        TRUE     NULL
#> 2  05                         Living conditions   TRUE        TRUE     NULL
#> 3  03                   Education and knowledge   TRUE        TRUE     NULL
#> 4  18               Culture and National Church   TRUE        TRUE     NULL
#> 5  04                 Labour, income and wealth   TRUE        TRUE     NULL
#> 6  06                    Prices and consumption   TRUE        TRUE     NULL
#> 7  14 National accounts and government finances   TRUE        TRUE     NULL
#> 8  16                   Money and credit market   TRUE        TRUE     NULL
#> 9  13                          External economy   TRUE        TRUE     NULL
#> 10 07                Business sector in general   TRUE        TRUE     NULL
#> 11 11                          Business sectors   TRUE        TRUE     NULL
#> 12 01         Geography, environment and energy   TRUE        TRUE     NULL
#> 13 19                                     Other   TRUE        TRUE     NULL

The subjects of interest for us are “Living conditions” and “Prices and consumption”. We look into more details into these to subjects:

subj <- get_subjects(subjects = c("05","06"))
subsubjects <- subj$subjects %>% bind_rows()
subsubjects
#>      id                       description active hasSubjects subjects
#> 1  2585                   Gender equality   TRUE        TRUE     NULL
#> 2  2557                   Quality of life   TRUE        TRUE     NULL
#> 3  2477                Welfare indicators   TRUE        TRUE     NULL
#> 4  2415                           Housing   TRUE        TRUE     NULL
#> 5  2412                            Health   TRUE        TRUE     NULL
#> 6  2555                         Childcare   TRUE        TRUE     NULL
#> 7  2403 Persons receiving public benefits   TRUE        TRUE     NULL
#> 8  2404                 Social conditions   TRUE        TRUE     NULL
#> 9  2413                 Traffic accidents   TRUE        TRUE     NULL
#> 10 2414                 Criminal offences   TRUE        TRUE     NULL
#> 11 2436              Consumer price index   TRUE        TRUE     NULL
#> 12 2437                   Business prices   TRUE        TRUE     NULL
#> 13 2438                       Consumption   TRUE        TRUE     NULL
#> 14 2439                     Real property   TRUE        TRUE     NULL
#> 15 2440                              Cars   TRUE        TRUE     NULL
#> 16 2476                      Cost indices   TRUE        TRUE     NULL

Getting table information and variables

Now we would like to see what data tables are available for subjects “Traffic accidents” and “Consumption”:

tables <- get_tables(subjects = c("2413", "2438")) 
tables %>% 
  select(id, text, variables) %>% 
  kable()

id	text	variables
UHELD3	Road traffic accidents	c(“type of accident”, “accident situation”, “urban area”, “speed limit”, “time”)
UHELD4	Road traffic accidents	c(“type of accident”, “means of transport”, “hour”, “day of the week”, “month”, “time”)
UHELD5	Road traffic accidents	c(“type of accident”, “accident situation”, “means of transport”, “other units involved”, “time”)
UHELD6	Road traffic accidents	c(“type of accident”, “accident situation”, “means of transport”, “number of transport units involved”, “time”)
UHELDK7	Road traffic accidents	c(“type of accident”, “region”, “urban area”, “accident situation”, “time”)
UHELD8	Injured and killed in road traffic accidents	c(“type of accident”, “casualty”, “means of transport”, “sex”, “age”, “type of injury”, “time”)
UHELD9	Injured and killed in road traffic accidents	c(“type of accident”, “casualty”, “means of transport”, “type of person”, “urban area”, “type of road”, “time”)
UHELD10	Injured and killed in road traffic accidents	c(“type of accident”, “accident situation”, “casualty”, “means of transport”, “other units involved”, “time”)
UHELD11	Injured and killed in road traffic accidents	c(“type of accident”, “casualty”, “means of transport”, “position”, “type of person”, “influenced by alcohol”, “time”)
UHELDK1	Injured and killed in road traffic accidents	c(“region”, “casualty”, “motor vehicles involved”, “age”, “sex”, “time”)
UHELDK2	Injured and killed in alcohol accidents	c(“region”, “casualty”, “age”, “sex”, “time”)
UHELD12	Drivers and pedestrians	c(“means of transport”, “sex”, “age”, “age of driving license”, “level of alcohol in blood”, “time”)
MOERKE	Injured in road traffic accidents reported by the police and casualty wards	c(“reporter”, “accident situation”, “means of transport”, “sex”, “age”, “type of injury”, “time”)
BANE91	Fatalities and injuries in railway traffic accidents	c(“railway system”, “category of person”, “casualty”, “time”)
BANE92	Fatalities and injuries in railway traffic accidents	c(“railway system”, “type of accident”, “casualty”, “time”)
SKIB92	Accidents at sea involving Danish vessels	c(“type of accident”, “extent”, “time”)
SKIB93	Accidents and fatalities onboard Danish vessels Danish ships	c(“type of vessel”, “accidents”, “time”)
SKIB94	Accidents at sea in Danish sea territory	c(“type of accident”, “waters”, “extent”, “time”)
FU01	Household budget suvevey	c(“background information”, “group of households”, “time”)
FU02	Household average consumption	c(“group of consumption”, “price unit”, “time”)
FU03	Consumption	c(“group of consumption”, “households”, “price unit”, “time”)
FU04	Consumption	c(“group of consumption”, “socioeconomic status”, “price unit”, “time”)
FU05	Consumption	c(“group of consumption”, “total income”, “price unit”, “time”)
FU06	Consumption	c(“group of consumption”, “type of dwelling”, “price unit”, “time”)
FU07	Consumption	c(“group of consumption”, “region”, “price unit”, “time”)
FU08	Consumption	c(“group of consumption”, “age”, “price unit”, “time”)
FU09	Average household income and consumption	c(“type of income”, “time”)
ALKO2	Consumption and sales of alcohol and tobacco, subject to excises duties, by pop.	c(“type”, “time”)
ALKO3	Consumption and sales of alcohol, subject to excises duties	c(“type”, “time”)
ALKO4	Consumption and sales of alcohol and tobacco, subject to excises duties	c(“type”, “time”)
VARFORBR	The families possession of home appliances	c(“type of consumption”, “time”)
FORV1	Consumer confidence indicator	c(“indicator”, “time”)

If we would like to examine number of traffic accidents by hour of day (which is an interesing question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.

vars_acc <- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_alco <- get_table_metadata(table_id = "alko3", variables_only = TRUE)

vars_acc %>% 
  select(id, text)
#>         id               text
#> 1   UHELDA   type of accident
#> 2 TRANSMID means of transport
#> 3     KLOK               hour
#> 4      UGE    day of the week
#> 5      MND              month
#> 6      Tid               time

vars_alco %>% 
  select(id, text)
#>     id text
#> 1 TYPE type
#> 2  Tid time

Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.

vars_acc$values[1] %>% 
  kable()

id	text
0000	All accidents
1000	Accidents involving persons influenced by alcohol
2000	Others accidents


vars_alco$values[1] %>% 
  kable()

id	text
09	Sales of lager equivalents, total excl. tax free beer (index 2000=100) (2000-)
010	Sales of wine, total (indeks 2000=100)
011	Sales of spirits, total (indeks 2000=100)
0111	Sales of alcopops, total (indeks 2006=100)(2006-)
055	Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-
060	Consumption of wine, total (index 2000=100) (2000-)
065	Consumption of spirits, total (index 2000=100) (2000-)
070	Consumption of alcopops, total (index 2006=100) (2006-)

To compare the diurnal patterns of alcohol-related accidents vs. other accidents we will select values c(1000, 2000) from the “UHELDA” variable and values c("09", "055") (sales and consumption of lager equivalents) from the “TYPE” variable. As we would like data for all time periods (“Tid”) we will leave the values for that variable as NA.

Pulling data from the API

We can now pull the needed data from the API - first the accident numbers:

variable_codes <- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_values <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
accidents <- get_data("uheld4", variables = variable_input)
head(accidents) %>% kable()

UHELDA	KLOK	TID	INDHOLD
Accidents involving persons influenced by alcohol	00-01	1997	94
Accidents involving persons influenced by alcohol	00-01	1998	85
Accidents involving persons influenced by alcohol	00-01	1999	81
Accidents involving persons influenced by alcohol	00-01	2000	87
Accidents involving persons influenced by alcohol	00-01	2001	77
Accidents involving persons influenced by alcohol	00-01	2002	91

and then the alcohol sales and consumption values:

variable_codes <- vars_alco$id 
variable_values <- list(c("055", "09"), NA) # All values for Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
alcohol <- get_data("alko3", variables = variable_input)
alcohol %>% 
  filter(INDHOLD != "..") %>% # the API returns ".." as missing values
  head() %>% 
  kable()

TYPE	TID	INDHOLD
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2000	100
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2001	97
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2002	96
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2003	98
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2004	93
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2005	93

Analysis 1: Duirnal patterns of traffic accidents

An interesting analysis would be to see at which time of day most accidents occur:

accidents_by_hour <- accidents %>%
    filter(KLOK != "Not stated") %>%
    group_by(UHELDA, KLOK) %>%
    summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))

accidents_by_hour %>%
    ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
    geom_line() +
    geom_point() +
    theme_bw() + 
  theme(legend.position="top") +
  labs(x = "Time of day", y = "Average annual accidents")

A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50, occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.

Analysis 2: Alcohol sales and alcohol-related accidents

First, we summarize the accidents dataset and check the length of the series in both datasets

accidents_by_year <- accidents %>% 
  group_by(UHELDA, TID) %>% 
  summarize(INDHOLD = sum(INDHOLD)) %>% 
  ungroup()

accidents_by_year %>%
  group_by(UHELDA) %>% 
  summarise(min(TID), max(TID))
#> # A tibble: 2 x 3
#>   UHELDA                                            `min(TID)` `max(TID)`
#>   <chr>                                                  <dbl>      <dbl>
#> 1 Accidents involving persons influenced by alcohol       1997       2018
#> 2 Others accidents                                        1997       2018

alcohol_by_year <- alcohol %>% 
  filter(INDHOLD != "..") %>% # the API returns ".." as missing values
  mutate(INDHOLD = as.numeric(INDHOLD))

alcohol_by_year %>%
  group_by(TYPE) %>% 
  summarise(min(TID), max(TID))
#> # A tibble: 2 x 3
#>   TYPE                                                     `min(TID)` `max(TID)`
#>   <chr>                                                         <dbl>      <dbl>
#> 1 Consumption of lager equivalents, total, excl. tax free…       2000       2010
#> 2 Sales of lager equivalents, total excl. tax free beer (…       1955       2018

Since alcohol consumption is only available up to 2010, we will use alcohol sales as a proxy. Also, as the alcohol consumption is indexed with year 2000 as index 100, we transoform the accidents values in the same way.


alcohol_data <- alcohol_by_year %>% 
  filter(between(TID, 1997, 2018),
         grepl("sales", TYPE, ignore.case = TRUE)) %>% 
  select(year = TID,
         alcohol_sales = INDHOLD)

accidents_data <- accidents_by_year %>% 
  filter(grepl("alcohol", UHELDA, ignore.case = TRUE)) %>% 
  select(year = TID,
         alcohol_accidents = INDHOLD)

accidents_2000 <- accidents_data %>% 
  filter(year == 2000) %>% 
  pull(alcohol_accidents)
  
accidents_data$alcohol_accidents <- round((accidents_data$alcohol_accidents/accidents_2000)*100)

data <- inner_join(alcohol_data, accidents_data, by = "year") %>% 
  mutate(risk_index = round(alcohol_accidents/alcohol_sales*100))

data %>% 
  ggplot(aes(x=year, y=risk_index)) +
  geom_col()

As this vignette does not by any means aspire to the standards of rigourous academic research, we can only say here that there seems to be an indication that the risk of alcohol-related accidents has dropped substantially since 2000. Sales of alcohol might be a poor proxy for consumption, patterns of cross-border purchase of alcohol may have changed over the past 20 years, cars have definitely become safer over time, etc. Disentangling these effects is well beyond the scope of this article.