vtreat
can now effectively prepare data for multi-class classification or multinomial modeling.
The two functions needed (mkCrossFrameMExperiment()
and the S3
method prepare.multinomial_plan()
) are now part of vtreat
.
Let’s work a specific example: trying to model multi-class y
as a function of x1
and x2
.
library("vtreat")
## Loading required package: wrapr
# create example data
set.seed(326346)
<- rnorm(3)
sym_bonuses names(sym_bonuses) <- c("a", "b", "c")
<- rnorm(3)
sym_bonuses3 names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3)))
<- 1000
n_row <- data.frame(
d x1 = rnorm(n_row),
x2 = sample(names(sym_bonuses), n_row, replace = TRUE),
x3 = sample(names(sym_bonuses3), n_row, replace = TRUE),
y = "NoInfo",
stringsAsFactors = FALSE)
$y[sym_bonuses[d$x2] >
dpmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1"
$y[sym_bonuses3[d$x3] >
dpmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2"
::kable(head(d)) knitr
x1 | x2 | x3 | y |
---|---|---|---|
0.8178292 | a | 2 | NoInfo |
0.5867139 | c | 1 | NoInfo |
-0.6711920 | a | 3 | Large2 |
0.1033166 | a | 2 | Large1 |
-0.3182176 | c | 3 | Large2 |
-0.5914308 | c | 2 | NoInfo |
We define the problem controls and use mkCrossFrameMExperiment()
to build both a cross-frame and a treatment plan.
# define problem
<- c("x1", "x2", "x3")
vars <- "y"
y_name
# build the multi-class cross frame and treatments
<- mkCrossFrameMExperiment(d, vars, y_name) cfe_m
The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.
# look at the data we would train models on
str(cfe_m$cross_frame)
## 'data.frame': 1000 obs. of 16 variables:
## $ x1 : num 0.818 0.587 -0.671 0.103 -0.318 ...
## $ x2_catP : num 0.333 0.334 0.333 0.333 0.334 0.334 0.333 0.333 0.333 0.333 ...
## $ x3_catP : num 0.35 0.321 0.329 0.35 0.329 0.35 0.321 0.321 0.321 0.35 ...
## $ x2_lev_x_a : num 1 0 1 1 0 0 0 0 1 1 ...
## $ x2_lev_x_b : num 0 0 0 0 0 0 1 1 0 0 ...
## $ x2_lev_x_c : num 0 1 0 0 1 1 0 0 0 0 ...
## $ x3_lev_x_1 : num 0 1 0 0 0 0 1 1 1 0 ...
## $ x3_lev_x_2 : num 1 0 0 1 0 1 0 0 0 1 ...
## $ x3_lev_x_3 : num 0 0 1 0 1 0 0 0 0 0 ...
## $ Large1_x2_catB: num 1.23 -10.72 1.15 1.16 -10.53 ...
## $ Large1_x3_catB: num 0.7025 0.0903 -10.4833 0.6238 -10.529 ...
## $ Large2_x2_catB: num 0.17979 0.19661 -0.00379 -0.09818 0.00627 ...
## $ Large2_x3_catB: num -13.12 -13.05 4.49 -4.03 4.71 ...
## $ NoInfo_x2_catB: num -0.48752 -0.00254 -0.27947 -0.26155 0.15195 ...
## $ NoInfo_x3_catB: num 2.05 2.43 -4.34 1.79 -4.55 ...
## $ y : chr "NoInfo" "NoInfo" "Large2" "Large1" ...
prepare()
can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.
# pretend original data is new data to be treated
# NA out top row to show processing
for(vi in vars) {
1]] <- NA
d[[vi]][[
}str(prepare(cfe_m$treat_m, d))
## 'data.frame': 1000 obs. of 16 variables:
## $ x1 : num 0.0205 0.5867 -0.6712 0.1033 -0.3182 ...
## $ x2_catP : num 0.0005 0.334 0.333 0.333 0.334 0.334 0.333 0.333 0.333 0.333 ...
## $ x3_catP : num 0.0005 0.321 0.329 0.35 0.329 0.35 0.321 0.321 0.321 0.35 ...
## $ x2_lev_x_a : num 0 0 1 1 0 0 0 0 1 1 ...
## $ x2_lev_x_b : num 0 0 0 0 0 0 1 1 0 0 ...
## $ x2_lev_x_c : num 0 1 0 0 1 1 0 0 0 0 ...
## $ x3_lev_x_1 : num 0 1 0 0 0 0 1 1 1 0 ...
## $ x3_lev_x_2 : num 0 0 0 1 0 1 0 0 0 1 ...
## $ x3_lev_x_3 : num 0 0 1 0 1 0 0 0 0 0 ...
## $ Large1_x2_catB: num 0 -10.58 1.18 1.18 -10.58 ...
## $ Large1_x3_catB: num 0 0.284 -10.584 0.529 -10.584 ...
## $ Large2_x2_catB: num 0 0.1 0.0242 0.0242 0.1 ...
## $ Large2_x3_catB: num 0 -13.08 4.72 -4.43 4.72 ...
## $ NoInfo_x2_catB: num 0 0.0685 -0.3392 -0.3392 0.0685 ...
## $ NoInfo_x3_catB: num 0 2.39 -4.55 2.05 -4.55 ...
## $ y : chr "NoInfo" "NoInfo" "Large2" "Large1" ...
Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).
We also have per-outcome variable importance.
::kable(
knitr$score_frame[,
cfe_mc("varName", "rsq", "sig", "outcome_level"),
drop = FALSE])
varName | rsq | sig | outcome_level |
---|---|---|---|
x1 | 0.0427675 | 0.0002015 | Large1 |
x2_catP | 0.0979334 | 0.0000000 | Large1 |
x2_lev_x_a | 0.2681130 | 0.0000000 | Large1 |
x2_lev_x_b | 0.0975700 | 0.0000000 | Large1 |
x2_lev_x_c | 0.0979334 | 0.0000000 | Large1 |
x3_catP | 0.0125618 | 0.0439536 | Large1 |
x3_lev_x_1 | 0.0053772 | 0.1874933 | Large1 |
x3_lev_x_2 | 0.0266092 | 0.0033678 | Large1 |
x3_lev_x_3 | 0.0961219 | 0.0000000 | Large1 |
x1 | 0.0003984 | 0.4784542 | Large2 |
x2_catP | 0.0008969 | 0.2875322 | Large2 |
x2_lev_x_a | 0.0000512 | 0.7994128 | Large2 |
x2_lev_x_b | 0.0013961 | 0.1845435 | Large2 |
x2_lev_x_c | 0.0008969 | 0.2875322 | Large2 |
x3_catP | 0.0574052 | 0.0000000 | Large2 |
x3_lev_x_1 | 0.2546121 | 0.0000000 | Large2 |
x3_lev_x_2 | 0.2659830 | 0.0000000 | Large2 |
x3_lev_x_3 | 0.9308590 | 0.0000000 | Large2 |
x1 | 0.0035420 | 0.0312177 | NoInfo |
x2_catP | 0.0004091 | 0.4641054 | NoInfo |
x2_lev_x_a | 0.0108027 | 0.0001684 | NoInfo |
x2_lev_x_b | 0.0072297 | 0.0020855 | NoInfo |
x2_lev_x_c | 0.0004091 | 0.4641054 | NoInfo |
x3_catP | 0.0416046 | 0.0000000 | NoInfo |
x3_lev_x_1 | 0.1848006 | 0.0000000 | NoInfo |
x3_lev_x_2 | 0.1796720 | 0.0000000 | NoInfo |
x3_lev_x_3 | 0.7228777 | 0.0000000 | NoInfo |
Large1_x2_catB | 0.2679354 | 0.0000000 | Large1 |
Large1_x3_catB | 0.0835409 | 0.0000002 | Large1 |
Large2_x2_catB | 0.0002176 | 0.6004146 | Large2 |
Large2_x3_catB | 0.9064823 | 0.0000000 | Large2 |
NoInfo_x2_catB | 0.0080585 | 0.0011565 | NoInfo |
NoInfo_x3_catB | 0.7143906 | 0.0000000 | NoInfo |
One can relate these per-target and per-treatment performances back to original columns by aggregating.
tapply(cfe_m$score_frame$rsq,
$score_frame$origName,
cfe_m max)
## x1 x2 x3
## 0.04276746 0.26811298 0.93085900
tapply(cfe_m$score_frame$sig,
$score_frame$origName,
cfe_m min)
## x1 x2 x3
## 2.015164e-04 1.315559e-20 2.777723e-257