Introduction to cattonum

Case study

We’ll demonstrate how to use cattonum by predicting flight delays (dep_delay) in the the nycflights13::flights dataset using random forests built with ranger.

library(nycflights13)
library(ranger)
library(cattonum)
#> cattonum is seeking a new maintainer; please respond if interested: https://github.com/bfgray3/cattonum/issues/40
suppressPackageStartupMessages(library(dplyr))

set.seed(4444)

data(flights)
str(flights)
#> tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
#>  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
#>  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
#>  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
#>  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#>  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
#>  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
#>  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
#>  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
#>  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
#>  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
#>  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
#>  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
#>  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
#>  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
#>  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
#>  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
#>  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

There are a lot of flights here and we don’t want our model training to take forever, so let’s only take a subset of these observations. To simplify our analysis, we’ll analyze only the three airlines with the most flights, consider only flights that were delayed in taking off, and remove some features.

airlines_to_keep <- flights %>%
                      count(carrier) %>%
                      top_n(3, n) %>%
                      pull(carrier)

flights <- flights %>%
             filter(carrier %in% airlines_to_keep, dep_delay > 0) %>%
             select(-c(year, dep_time, sched_dep_time, arr_time, sched_arr_time,
                       arr_delay, flight, tailnum, time_hour))

In order to get more out of our time features, we’ll do a quick transformation using the technique described here. Also, month and day are currently integers, but they really are categorical, so we now turn them into characters (or factors, it doesn’t matter for cattonum). For simplicity and to maintain focus on cattonum, simply drop observations with missing values.

tot_mins <- 24 * 60

flights <- flights %>%
             mutate(min_of_day = 60 * hour + minute,
                    cos_min_of_day = cos(2 * pi * min_of_day / tot_mins),
                    sin_min_of_day = cos(2 * pi * min_of_day / tot_mins)) %>%
             select(-c(min_of_day, hour, minute)) %>%
             mutate(month = as.character(month),
                    day = as.character(day)) %>%
             filter(complete.cases(.))
str(flights)
#> tibble [71,473 × 10] (S3: tbl_df/tbl/data.frame)
#>  $ month         : chr [1:71473] "1" "1" "1" "1" ...
#>  $ day           : chr [1:71473] "1" "1" "1" "1" ...
#>  $ dep_delay     : num [1:71473] 2 4 1 11 3 24 8 1 1 1 ...
#>  $ carrier       : chr [1:71473] "UA" "UA" "B6" "UA" ...
#>  $ origin        : chr [1:71473] "EWR" "LGA" "EWR" "JFK" ...
#>  $ dest          : chr [1:71473] "IAH" "IAH" "PBI" "SFO" ...
#>  $ air_time      : num [1:71473] 227 227 147 366 175 52 151 243 380 188 ...
#>  $ distance      : num [1:71473] 1400 1416 1023 2586 1074 ...
#>  $ cos_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...
#>  $ sin_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...

Now we turn to encoding our categorical features. Consider a comparison between label encoding, mean encoding, and a mix of frequency, label, and mean encoding. First, note a few key facts about the catto_* functions.

They have three or four main parameters, depending on the function: train, ..., response, and test. train holds the training data, ... is where the columns to be encoded are specified (if not supplied, all character and factor columns are encoded), response is the name of the response column for catto_loo and catto_mean, and test (if supplied) holds the test data.
The encoded data will be returned in either a data.frame or a tibble, whichever was passed.
If test is not supplied, the functions return a data.frame or tibble, as described above. Otherwise, they return a length-two list holding the relevant encoded datasets with names train and test.
They are designed to work in dplyr-style pipelines using %>% from magrittr.
They can handle data in a data.frame or tibble, and features can be specified in many different ways like in dplyr. For example, the following are all equivalent for a data frame named dat with columns x1 and x2.

catto_label(dat)
catto_label(dat, x1, x2)
catto_label(dat, c(x1, x2))
catto_label(dat, c("x1", "x2"))
catto_label(dat, one_of(c("x1", "x2"))) # one_of is exported by dplyr
catto_label(dat, one_of("x1", "x2"))

Here we make the encoded datasets.

label_encoded <- flights %>%
                   catto_label()
str(label_encoded)
#> tibble [71,473 × 10] (S3: cattonum_df/tbl_df/tbl/data.frame)
#>  $ month         : int [1:71473] 5 5 5 5 5 5 5 5 5 5 ...
#>  $ day           : int [1:71473] 20 20 20 20 20 20 20 20 20 20 ...
#>  $ dep_delay     : num [1:71473] 2 4 1 11 3 24 8 1 1 1 ...
#>  $ carrier       : int [1:71473] 3 3 1 3 1 2 3 3 3 3 ...
#>  $ origin        : int [1:71473] 3 1 3 2 2 3 3 1 3 3 ...
#>  $ dest          : int [1:71473] 92 92 91 96 83 88 97 89 47 87 ...
#>  $ air_time      : num [1:71473] 227 227 147 366 175 52 151 243 380 188 ...
#>  $ distance      : num [1:71473] 1400 1416 1023 2586 1074 ...
#>  $ cos_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...
#>  $ sin_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...

mean_encoded <- flights %>%
                  catto_mean(response = dep_delay)
str(mean_encoded)
#> tibble [71,473 × 10] (S3: cattonum_df/tbl_df/tbl/data.frame)
#>  $ month         : num [1:71473] 35.2 35.2 35.2 35.2 35.2 ...
#>  $ day           : num [1:71473] 42.4 42.4 42.4 42.4 42.4 ...
#>  $ dep_delay     : num [1:71473] 2 4 1 11 3 24 8 1 1 1 ...
#>  $ carrier       : num [1:71473] 29.8 29.8 39.7 29.8 39.7 ...
#>  $ origin        : num [1:71473] 38.5 47.4 38.5 37.6 37.6 ...
#>  $ dest          : num [1:71473] 27.6 27.6 35.3 30.7 29.3 ...
#>  $ air_time      : num [1:71473] 227 227 147 366 175 52 151 243 380 188 ...
#>  $ distance      : num [1:71473] 1400 1416 1023 2586 1074 ...
#>  $ cos_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...
#>  $ sin_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...

mix_encoded <- flights %>%
                 catto_freq(dest) %>%
                 catto_label(origin) %>%
                 catto_mean(response = dep_delay)
str(mix_encoded)
#> tibble [71,473 × 10] (S3: cattonum_df/cattonum_df/cattonum_df/tbl_df/tbl/data.frame)
#>  $ month         : num [1:71473] 35.2 35.2 35.2 35.2 35.2 ...
#>  $ day           : num [1:71473] 42.4 42.4 42.4 42.4 42.4 ...
#>  $ dep_delay     : num [1:71473] 2 4 1 11 3 24 8 1 1 1 ...
#>  $ carrier       : num [1:71473] 29.8 29.8 39.7 29.8 39.7 ...
#>  $ origin        : int [1:71473] 3 1 3 2 2 3 3 1 3 3 ...
#>  $ dest          : int [1:71473] 3142 3142 2340 3547 1129 1767 3999 1996 336 1562 ...
#>  $ air_time      : num [1:71473] 227 227 147 366 175 52 151 243 380 188 ...
#>  $ distance      : num [1:71473] 1400 1416 1023 2586 1074 ...
#>  $ cos_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...
#>  $ sin_min_of_day: num [1:71473] 1.95e-01 1.35e-01 6.12e-17 6.12e-17 -4.36e-02 ...

Now we can finally build the models. We define a short function get_oob_error that builds an untuned random forest and returns the out-of-bag error.

encodings <- list(label = label_encoded,
                  mean = mean_encoded,
                  mix = mix_encoded)

get_oob_error <- function(dat) {
  rf <- ranger(data = as.data.frame(dat), # ranger can't handle tibbles
               num.trees = 100,
               dependent.variable.name = "dep_delay")
  rf$prediction.error
}

lapply(encodings, get_oob_error)
#> $label
#> [1] 2433.914
#> 
#> $mean
#> [1] 2421.435
#> 
#> $mix
#> [1] 2415.151

Mean encoding gives us the lowest OOB error, followed by the mixed encodings and label encoding. This modeling setup of simply looking at OOB score on untuned random forests of 100 trees is not really a fair comparison, but it demonstrates the basic features of cattonum.

Introduction to cattonum

Background

Case study