This vignette introduces the danstat
package by means of a simple use case to illustrate the data discovery process. The package is designed to give easy and intuitive access to the Danmarks Statistik (Statistics Denmark) Statistikbank API which allows access to all published data in StatBank. As a public institution, Statistics Denmark provides open access to the data, so the API doesn’t require authentication and has no rate limits. The data is free to re-use and reproduce even for commercial use, however, source accreditation to Statistics Denmark should be given. See terms for the Use and reuse of data.
library(danstat)
library(purrr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#>
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#>
#> group_rows
In this example we would like to examine alcohol related incidents and the consumption of alcohol in Denmark. An inital call to get_subjects()
returns the following data frame:
get_subjects()
#> id description active hasSubjects subjects
#> 1 02 Population and elections TRUE TRUE NULL
#> 2 05 Living conditions TRUE TRUE NULL
#> 3 03 Education and knowledge TRUE TRUE NULL
#> 4 18 Culture and National Church TRUE TRUE NULL
#> 5 04 Labour, income and wealth TRUE TRUE NULL
#> 6 06 Prices and consumption TRUE TRUE NULL
#> 7 14 National accounts and government finances TRUE TRUE NULL
#> 8 16 Money and credit market TRUE TRUE NULL
#> 9 13 External economy TRUE TRUE NULL
#> 10 07 Business sector in general TRUE TRUE NULL
#> 11 11 Business sectors TRUE TRUE NULL
#> 12 01 Geography, environment and energy TRUE TRUE NULL
#> 13 19 Other TRUE TRUE NULL
The subjects of interest for us are “Living conditions” and “Prices and consumption”. We look into more details into these to subjects:
subj <- get_subjects(subjects = c("05","06"))
subsubjects <- subj$subjects %>% bind_rows()
subsubjects
#> id description active hasSubjects subjects
#> 1 2585 Gender equality TRUE TRUE NULL
#> 2 2557 Quality of life TRUE TRUE NULL
#> 3 2477 Welfare indicators TRUE TRUE NULL
#> 4 2415 Housing TRUE TRUE NULL
#> 5 2412 Health TRUE TRUE NULL
#> 6 2555 Childcare TRUE TRUE NULL
#> 7 2403 Persons receiving public benefits TRUE TRUE NULL
#> 8 2404 Social conditions TRUE TRUE NULL
#> 9 2413 Traffic accidents TRUE TRUE NULL
#> 10 2414 Criminal offences TRUE TRUE NULL
#> 11 2436 Consumer price index TRUE TRUE NULL
#> 12 2437 Business prices TRUE TRUE NULL
#> 13 2438 Consumption TRUE TRUE NULL
#> 14 2439 Real property TRUE TRUE NULL
#> 15 2440 Cars TRUE TRUE NULL
#> 16 2476 Cost indices TRUE TRUE NULL
Now we would like to see what data tables are available for subjects “Traffic accidents” and “Consumption”:
tables <- get_tables(subjects = c("2413", "2438"))
tables %>%
select(id, text, variables) %>%
kable()
id | text | variables |
---|---|---|
UHELD3 | Road traffic accidents | c(“type of accident”, “accident situation”, “urban area”, “speed limit”, “time”) |
UHELD4 | Road traffic accidents | c(“type of accident”, “means of transport”, “hour”, “day of the week”, “month”, “time”) |
UHELD5 | Road traffic accidents | c(“type of accident”, “accident situation”, “means of transport”, “other units involved”, “time”) |
UHELD6 | Road traffic accidents | c(“type of accident”, “accident situation”, “means of transport”, “number of transport units involved”, “time”) |
UHELDK7 | Road traffic accidents | c(“type of accident”, “region”, “urban area”, “accident situation”, “time”) |
UHELD8 | Injured and killed in road traffic accidents | c(“type of accident”, “casualty”, “means of transport”, “sex”, “age”, “type of injury”, “time”) |
UHELD9 | Injured and killed in road traffic accidents | c(“type of accident”, “casualty”, “means of transport”, “type of person”, “urban area”, “type of road”, “time”) |
UHELD10 | Injured and killed in road traffic accidents | c(“type of accident”, “accident situation”, “casualty”, “means of transport”, “other units involved”, “time”) |
UHELD11 | Injured and killed in road traffic accidents | c(“type of accident”, “casualty”, “means of transport”, “position”, “type of person”, “influenced by alcohol”, “time”) |
UHELDK1 | Injured and killed in road traffic accidents | c(“region”, “casualty”, “motor vehicles involved”, “age”, “sex”, “time”) |
UHELDK2 | Injured and killed in alcohol accidents | c(“region”, “casualty”, “age”, “sex”, “time”) |
UHELD12 | Drivers and pedestrians | c(“means of transport”, “sex”, “age”, “age of driving license”, “level of alcohol in blood”, “time”) |
MOERKE | Injured in road traffic accidents reported by the police and casualty wards | c(“reporter”, “accident situation”, “means of transport”, “sex”, “age”, “type of injury”, “time”) |
BANE91 | Fatalities and injuries in railway traffic accidents | c(“railway system”, “category of person”, “casualty”, “time”) |
BANE92 | Fatalities and injuries in railway traffic accidents | c(“railway system”, “type of accident”, “casualty”, “time”) |
SKIB92 | Accidents at sea involving Danish vessels | c(“type of accident”, “extent”, “time”) |
SKIB93 | Accidents and fatalities onboard Danish vessels Danish ships | c(“type of vessel”, “accidents”, “time”) |
SKIB94 | Accidents at sea in Danish sea territory | c(“type of accident”, “waters”, “extent”, “time”) |
FU01 | Household budget suvevey | c(“background information”, “group of households”, “time”) |
FU02 | Household average consumption | c(“group of consumption”, “price unit”, “time”) |
FU03 | Consumption | c(“group of consumption”, “households”, “price unit”, “time”) |
FU04 | Consumption | c(“group of consumption”, “socioeconomic status”, “price unit”, “time”) |
FU05 | Consumption | c(“group of consumption”, “total income”, “price unit”, “time”) |
FU06 | Consumption | c(“group of consumption”, “type of dwelling”, “price unit”, “time”) |
FU07 | Consumption | c(“group of consumption”, “region”, “price unit”, “time”) |
FU08 | Consumption | c(“group of consumption”, “age”, “price unit”, “time”) |
FU09 | Average household income and consumption | c(“type of income”, “time”) |
ALKO2 | Consumption and sales of alcohol and tobacco, subject to excises duties, by pop. | c(“type”, “time”) |
ALKO3 | Consumption and sales of alcohol, subject to excises duties | c(“type”, “time”) |
ALKO4 | Consumption and sales of alcohol and tobacco, subject to excises duties | c(“type”, “time”) |
VARFORBR | The families possession of home appliances | c(“type of consumption”, “time”) |
FORV1 | Consumer confidence indicator | c(“indicator”, “time”) |
If we would like to examine number of traffic accidents by hour of day (which is an interesing question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.
vars_acc <- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_alco <- get_table_metadata(table_id = "alko3", variables_only = TRUE)
vars_acc %>%
select(id, text)
#> id text
#> 1 UHELDA type of accident
#> 2 TRANSMID means of transport
#> 3 KLOK hour
#> 4 UGE day of the week
#> 5 MND month
#> 6 Tid time
vars_alco %>%
select(id, text)
#> id text
#> 1 TYPE type
#> 2 Tid time
Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.
|
|
To compare the diurnal patterns of alcohol-related accidents vs. other accidents we will select values c(1000, 2000)
from the “UHELDA” variable and values c("09", "055")
(sales and consumption of lager equivalents) from the “TYPE” variable. As we would like data for all time periods (“Tid”) we will leave the values for that variable as NA
.
We can now pull the needed data from the API - first the accident numbers:
variable_codes <- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_values <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid
# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))
# Get data
accidents <- get_data("uheld4", variables = variable_input)
head(accidents) %>% kable()
UHELDA | KLOK | TID | INDHOLD |
---|---|---|---|
Accidents involving persons influenced by alcohol | 00-01 | 1997 | 94 |
Accidents involving persons influenced by alcohol | 00-01 | 1998 | 85 |
Accidents involving persons influenced by alcohol | 00-01 | 1999 | 81 |
Accidents involving persons influenced by alcohol | 00-01 | 2000 | 87 |
Accidents involving persons influenced by alcohol | 00-01 | 2001 | 77 |
Accidents involving persons influenced by alcohol | 00-01 | 2002 | 91 |
and then the alcohol sales and consumption values:
variable_codes <- vars_alco$id
variable_values <- list(c("055", "09"), NA) # All values for Tid
# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))
# Get data
alcohol <- get_data("alko3", variables = variable_input)
alcohol %>%
filter(INDHOLD != "..") %>% # the API returns ".." as missing values
head() %>%
kable()
TYPE | TID | INDHOLD |
---|---|---|
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2000 | 100 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2001 | 97 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2002 | 96 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2003 | 98 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2004 | 93 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2005 | 93 |
An interesting analysis would be to see at which time of day most accidents occur:
accidents_by_hour <- accidents %>%
filter(KLOK != "Not stated") %>%
group_by(UHELDA, KLOK) %>%
summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))
accidents_by_hour %>%
ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
geom_line() +
geom_point() +
theme_bw() +
theme(legend.position="top") +
labs(x = "Time of day", y = "Average annual accidents")
A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50, occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.