This vignette introduces the danstat
package by means of a simple use case to illustrate the data discovery process. The package is designed to give easy and intuitive access to the Danmarks Statistik (Statistics Denmark) Statistikbank API which allows access to all published data in StatBank. As a public institution, Statistics Denmark provides open access to the data, so the API doesn’t require authentication and has no rate limits. The data is free to re-use and reproduce even for commercial use, however, source accreditation to Statistics Denmark should be given. See terms for the Use and reuse of data.
library(danstat)
library(purrr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#>
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#>
#> group_rows
In this example we would like to examine alcohol related incidents and the consumption of alcohol in Denmark. An inital call to get_subjects()
returns the following data frame:
get_subjects()
#> id description active hasSubjects subjects
#> 1 1 People TRUE TRUE NULL
#> 2 2 Labour and income TRUE TRUE NULL
#> 3 3 Economy TRUE TRUE NULL
#> 4 4 Social conditions TRUE TRUE NULL
#> 5 5 Education and research TRUE TRUE NULL
#> 6 6 Business TRUE TRUE NULL
#> 7 7 Transport TRUE TRUE NULL
#> 8 8 Culture and leisure TRUE TRUE NULL
#> 9 9 Environment and energy TRUE TRUE NULL
#> 10 19 Other TRUE TRUE NULL
The subjects of interest for us are “Business” and “Transport”. We look into more details into these to subjects:
<- get_subjects(subjects = c("6","7"))
subj <- subj$subjects %>% bind_rows()
subsubjects
subsubjects#> id description active hasSubjects subjects
#> 1 3456 Structure of business TRUE TRUE NULL
#> 2 3457 Business economy TRUE TRUE NULL
#> 3 3458 International enterprises TRUE TRUE NULL
#> 4 3460 Agriculture, horticulture and forestry TRUE TRUE NULL
#> 5 3461 Fishery and aquaculture TRUE TRUE NULL
#> 6 3462 Manufacturing industries TRUE TRUE NULL
#> 7 3463 Construction TRUE TRUE NULL
#> 8 3465 Distributive trades TRUE TRUE NULL
#> 9 3466 Accommodation and travel TRUE TRUE NULL
#> 10 3450 Financial sector TRUE TRUE NULL
#> 11 3467 Service sector TRUE TRUE NULL
#> 12 3459 Business tendency surveys TRUE TRUE NULL
#> 13 3440 Means of transport TRUE TRUE NULL
#> 14 3469 Infrastructure TRUE TRUE NULL
#> 15 3115 Passenger transport TRUE TRUE NULL
#> 16 3464 Goods transport TRUE TRUE NULL
#> 17 3413 Traffic accidents TRUE TRUE NULL
Now we would like to see what data tables are available for subjects “Distributive trades” and “Traffic accidents”:
<- get_tables(subjects = c("3465", "3413"))
tables %>%
tables select(id, text, variables) %>%
kable()
id | text | variables |
---|---|---|
DETA151 | Retail Trade Index | industry (DB07), time |
DETA152 | Retail Trade Index | commodity group, index type , time |
GTIN | Retail of groceries calculated on the basis of transaction data | commodity group, unit , time |
ALKO4 | Consumption and sales of alcohol and tobacco, subject to excises duties | type, time |
ALKO3 | Consumption and sales of alcohol, subject to excises duties | type, time |
ALKO2 | Consumption and sales of alcohol and tobacco, subject to excises duties, by pop. | type, time |
UHELD3 | Road traffic accidents | type of accident , accident situation, urban area , speed limit , time |
UHELD4 | Road traffic accidents | type of accident , means of transport, hour , day of the week , month , time |
UHELD5 | Road traffic accidents | type of accident , accident situation , means of transport , other units involved, time |
UHELD6 | Road traffic accidents | type of accident , accident situation , means of transport , number of transport units involved, time |
UHELDK7 | Road traffic accidents | type of accident , region , urban area , accident situation, time |
UHELD8 | Injured and killed in road traffic accidents | type of accident , casualty , means of transport, sex , age , type of injury , time |
UHELD9 | Injured and killed in road traffic accidents | type of accident , casualty , means of transport, type of person , urban area , type of road , time |
UHELD10 | Injured and killed in road traffic accidents | type of accident , accident situation , casualty , means of transport , other units involved, time |
UHELD11 | Injured and killed in road traffic accidents | type of accident , casualty , means of transport , position , type of person , influenced by alcohol, time |
UHELDK1 | Injured and killed in road traffic accidents | region , casualty , motor vehicles involved, age , sex , time |
UHELDK2 | Injured and killed in alcohol accidents | region , casualty, age , sex , time |
UHELD12 | Drivers and pedestrians | means of transport , sex , age , age of driving license , level of alcohol in blood, time |
MOERKE | Injured in road traffic accidents reported by the police and casualty wards | reporter , accident situation, means of transport, sex , age , type of injury , time |
MOERKE1 | Injured in road traffic accidents reported by casualty wards | reporter , accident situation , means of transport , sex , age , casualty (classified by diagnosis), time |
BANE91 | Fatalities and injuries in railway traffic accidents | railway system , category of person, casualty , time |
BANE92 | Fatalities and injuries in railway traffic accidents | railway system , type of accident, casualty , time |
SKIB95 | Accidents at sea involving danish vessels | type of accident, type of vessel , scope , time |
SKIB96 | Accidents at sea in Danish sea territory | type of accident , type of vessel , scope , land of registration, time |
If we would like to examine number of traffic accidents by hour of day (which is an interesting question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.
<- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_acc <- get_table_metadata(table_id = "alko3", variables_only = TRUE)
vars_alco
%>%
vars_acc select(id, text)
#> id text
#> 1 UHELDA type of accident
#> 2 TRANSMID means of transport
#> 3 KLOK hour
#> 4 UGE day of the week
#> 5 MND month
#> 6 Tid time
%>%
vars_alco select(id, text)
#> id text
#> 1 TYPE type
#> 2 Tid time
Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.
$values[1] %>%
vars_acckable()
|
$values[1] %>%
vars_alcokable()
|
To compare the diurnal patterns of alcohol-related accidents vs. other accidents we will select values c(1000, 2000)
from the “UHELDA” variable and values c("09", "055")
(sales and consumption of lager equivalents) from the “TYPE” variable. As we would like data for all time periods (“Tid”) we will leave the values for that variable as NA
.
We can now pull the needed data from the API - first the accident numbers:
<- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_codes <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid
variable_values
# Construct the variable_input as a list of code-values pairs
<- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))
variable_input
# Get data
<- get_data("uheld4", variables = variable_input)
accidents #> Rows: 1200 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): UHELDA, KLOK
#> dbl (2): TID, INDHOLD
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(accidents) %>% kable()
UHELDA | KLOK | TID | INDHOLD |
---|---|---|---|
Accidents involving persons influenced by alcohol | 00-01 | 1997 | 94 |
Accidents involving persons influenced by alcohol | 00-01 | 1998 | 85 |
Accidents involving persons influenced by alcohol | 00-01 | 1999 | 81 |
Accidents involving persons influenced by alcohol | 00-01 | 2000 | 87 |
Accidents involving persons influenced by alcohol | 00-01 | 2001 | 77 |
Accidents involving persons influenced by alcohol | 00-01 | 2002 | 91 |
and then the alcohol sales and consumption values:
<- vars_alco$id
variable_codes <- list(c("055", "09"), NA) # All values for Tid
variable_values
# Construct the variable_input as a list of code-values pairs
<- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))
variable_input
# Get data
<- get_data("alko3", variables = variable_input)
alcohol #> Rows: 132 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): TYPE, INDHOLD
#> dbl (1): TID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
%>%
alcohol filter(INDHOLD != "..") %>% # the API returns ".." as missing values
head() %>%
kable()
TYPE | TID | INDHOLD |
---|---|---|
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2000 | 100 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2001 | 97 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2002 | 96 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2003 | 98 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2004 | 93 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2005 | 93 |
An interesting analysis would be to see at which time of day most accidents occur:
<- accidents %>%
accidents_by_hour filter(KLOK != "Not stated") %>%
group_by(UHELDA, KLOK) %>%
summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))
#> `summarise()` has grouped output by 'UHELDA'. You can override using the
#> `.groups` argument.
%>%
accidents_by_hour ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
geom_line() +
geom_point() +
theme_bw() +
theme(legend.position="top") +
labs(x = "Time of day", y = "Average annual accidents")
A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50 occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.