Abstract
The packagetidyRules
is meant to extract parsable rules from model objects to a tibble/data.frame
format. Package supports the following models - C5.0
, rpart
and cubist
. The output rules are parsable by R, python (pandas query) and SQL (with WHERE clause).
library("tidyrules")
library("dplyr")
library("C50")
library("pander")
# build model
c5_model <- C5.0(Species ~ ., data = iris, rules = TRUE)
# extract rules in a tidy tibble
tidy_rules <- tidyRules(c5_model)
# View tidy_rules
tidy_rules %>%
select(-c(rule_number,trial_number)) %>%
pandoc.table()
##
## -----------------------------------------------------------------------
## id LHS RHS support confidence lift
## ---- ----------------------- ------------ --------- ------------ ------
## 1 Petal.Length <= 1.9 setosa 50 0.9808 2.9
##
## 2 Petal.Length > 1.9 & versicolor 48 0.96 2.9
## Petal.Length <= 4.9 &
## Petal.Width <= 1.7
##
## 3 Petal.Width > 1.7 virginica 46 0.9583 2.9
##
## 4 Petal.Length > 4.9 virginica 46 0.9375 2.8
## -----------------------------------------------------------------------
Filter rules based on RHS
or support
or confidence
or lift
:
# Example 1, filter rules based on support
tidy_rules %>%
filter(support >= 48) %>%
select(LHS, RHS)
## # A tibble: 2 x 2
## LHS RHS
## <chr> <chr>
## 1 Petal.Length <= 1.9 setosa
## 2 Petal.Length > 1.9 & Petal.Length <= 4.9 & Petal.Width <= 1.7 versicolor
# Example 2, filter rules based on RHS
tidy_rules %>%
filter(RHS == "virginica") %>%
select(LHS, support, confidence, lift)
## # A tibble: 2 x 4
## LHS support confidence lift
## <chr> <int> <dbl> <dbl>
## 1 Petal.Width > 1.7 46 0.958 2.9
## 2 Petal.Length > 4.9 46 0.938 2.8
Use a tidyrule
in a filter()
function :
iris %>%
filter(eval(parse(text = tidy_rules[3,"LHS"]))) %>% # filter using a C5 rule
count(Species)
## # A tibble: 2 x 2
## Species n
## <fct> <int>
## 1 versicolor 1
## 2 virginica 45
tidyrules
C5.0
In this example we use attrition
data from rsample
package. This illustration shows how to extract rules from C5.0
model and applying filter()
based on tidyrules.
# loading packages
library("tidyrules")
library("C50")
library("dplyr")
# attrition data load
data("attrition", package = "modeldata")
attrition <- as_tibble(attrition)
glimpse(attrition)
## Rows: 1,470
## Columns: 31
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, …
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No…
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rar…
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358…
## $ Department <fct> Sales, Research_Development, Research_Develo…
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26,…
## $ Education <ord> College, Below_College, College, Master, Bel…
## $ EducationField <fct> Life_Sciences, Life_Sciences, Other, Life_Sc…
## $ EnvironmentSatisfaction <ord> Medium, High, Very_High, Very_High, Low, Ver…
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Fema…
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, …
## $ JobInvolvement <ord> High, Medium, Medium, High, High, High, Very…
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1,…
## $ JobRole <fct> Sales_Executive, Research_Scientist, Laborat…
## $ JobSatisfaction <ord> Very_High, Medium, High, High, Medium, Very_…
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, S…
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 26…
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 996…
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5,…
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, …
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, …
## $ PerformanceRating <ord> Excellent, Outstanding, Excellent, Excellent…
## $ RelationshipSatisfaction <ord> Low, Very_High, Medium, High, Very_High, Hig…
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0,…
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, …
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4,…
## $ WorkLifeBalance <ord> Bad, Better, Better, Better, Better, Good, G…
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4…
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2,…
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0,…
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3,…
As you could see, there are 31 variables and 1470 observations are present this data-set. Here our aim is to predict Attrition using rest of the variables. Let us build a C5.0
model first.
# our C5 model
c5_att <- C5.0(Attrition ~ ., data = attrition, rules = TRUE)
# sample rules from C5
c5_att$output %>%
stringr::str_sub(start = 194L
, end = 578L) %>%
writeLines()
##
## Rule 1: (521/30, lift 1.1)
## EnvironmentSatisfaction in [Medium-Very_High]
## JobInvolvement in [Medium-Very_High]
## OverTime = No
## TrainingTimesLastYear > 1
## WorkLifeBalance in [Better-Best]
## -> class No [0.941]
##
## Rule 2: (195/14, lift 1.1)
## JobRole = Research_Scientist
## OverTime = No
## -> class No [0.924]
##
## Rule 3: (1347/183, lift 1.0)
## TotalWorkingYears > 2
## -> class No [0.864]
We get nice and human readable rules. Now problem with C5.0
summary is, you can only read and get a feel of how your predictions made based on rules. But here comes the hard part, imagine if you want to explore further about your data and you want to dig deeper, if you want to know rules which are throwing high lift and confidence, or you may be interested in rules which covers major sub-population. If in case your model is giving too many rules then that is the hardest part to go through each and every rules and identifying best rules out of the summary.
What if we have all the rules in a tidy table format so that we could easily use them on the data. Let’s get it done using tidyRules
.
# Extract rules to a tidy tibble
tr_att <- tidyRules(c5_att)
tr_att
## # A tibble: 24 x 8
## id LHS RHS support confidence lift rule_number trial_number
## <int> <chr> <chr> <int> <dbl> <dbl> <int> <int>
## 1 1 EnvironmentSat… No 521 0.941 1.1 1 1
## 2 2 JobRole == 'Re… No 195 0.924 1.1 2 1
## 3 3 TotalWorkingYe… No 1347 0.864 1 3 1
## 4 4 JobLevel <= 1 … Yes 16 0.944 5.9 4 1
## 5 5 DailyRate <= 7… Yes 13 0.933 5.8 5 1
## 6 6 EnvironmentSat… Yes 9 0.909 5.6 6 1
## 7 7 EnvironmentSat… Yes 9 0.909 5.6 7 1
## 8 8 JobRole == 'La… Yes 6 0.875 5.4 8 1
## 9 9 JobRole %in% c… Yes 14 0.875 5.4 9 1
## 10 10 Department == … Yes 13 0.867 5.4 10 1
## # … with 14 more rows
tidyRules
important columns to notice :
LHS
: Rules.RHS
: Predicted Class.support
: Number of observation covered by the rule.confidence
: Prediction accuracy for respective class. (laplace correction is implemented by default)lift
: The result of dividing the rule’s estimated accuracy by the relative frequency of the predicted class in the training set.Let’s have a look at first five rules
tr_att %>%
head(5) %>%
select(LHS,RHS) %>%
pandoc.table(split.cells = 60)
##
## -------------------------------------------------------------------
## LHS RHS
## ------------------------------------------------------------- -----
## EnvironmentSatisfaction %in% c('Medium', 'High', No
## 'Very_High') & JobInvolvement %in% c('Medium', 'High',
## 'Very_High') & OverTime == 'No' & TrainingTimesLastYear > 1
## & WorkLifeBalance %in% c('Better', 'Best')
##
## JobRole == 'Research_Scientist' & OverTime == 'No' No
##
## TotalWorkingYears > 2 No
##
## JobLevel <= 1 & MonthlyIncome <= 2468 & OverTime == 'Yes' & Yes
## TotalWorkingYears > 2 & YearsAtCompany <= 3
##
## DailyRate <= 722 & JobLevel <= 1 & MonthlyIncome <= 2468 & Yes
## OverTime == 'Yes' & TotalWorkingYears > 2
## -------------------------------------------------------------------
Now, all the rules are in tibble
(a tidy form of dataframe
) format. Let us look at rules which favors only Attrition is equal to “No” and arrange by support.
rules_example_1 <- tr_att %>%
filter(RHS == "No") %>%
arrange(desc(support))
rules_example_1
## # A tibble: 3 x 8
## id LHS RHS support confidence lift rule_number trial_number
## <int> <chr> <chr> <int> <dbl> <dbl> <int> <int>
## 1 3 TotalWorkingYea… No 1347 0.864 1 3 1
## 2 1 EnvironmentSati… No 521 0.941 1.1 1 1
## 3 2 JobRole == 'Res… No 195 0.924 1.1 2 1
filter()
function.Let’s use a rule within a filter()
. Say, one need to pick a rule which has largest support
for predicted Attrition “Yes”.
# filter a rule with conditions
large_support_rule <- tr_att %>%
filter(RHS == "Yes") %>%
top_n(1, wt = support) %>%
pull(LHS)
# parseable rule
parseable_rule <- parse(text = large_support_rule)
# apply filter on data frame using parseable rule
attrition %>%
filter(eval(parseable_rule))
## # A tibble: 57 x 31
## Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <int> <fct> <fct> <int> <fct> <int>
## 1 41 Yes Travel_Rarely 1102 Sales 1
## 2 46 No Travel_Freque… 1211 Sales 5
## 3 48 Yes Travel_Rarely 626 Research_… 1
## 4 50 No Travel_Rarely 989 Research_… 7
## 5 34 Yes Travel_Freque… 658 Research_… 7
## 6 31 Yes Travel_Rarely 249 Sales 6
## 7 31 Yes Travel_Rarely 542 Sales 20
## 8 41 Yes Travel_Rarely 1356 Sales 20
## 9 37 No Travel_Rarely 290 Research_… 21
## 10 52 No Non-Travel 771 Sales 2
## # … with 47 more rows, and 25 more variables: Education <ord>,
## # EducationField <fct>, EnvironmentSatisfaction <ord>, Gender <fct>,
## # HourlyRate <int>, JobInvolvement <ord>, JobLevel <int>, JobRole <fct>,
## # JobSatisfaction <ord>, MaritalStatus <fct>, MonthlyIncome <int>,
## # MonthlyRate <int>, NumCompaniesWorked <int>, OverTime <fct>,
## # PercentSalaryHike <int>, PerformanceRating <ord>,
## # RelationshipSatisfaction <ord>, StockOptionLevel <int>,
## # TotalWorkingYears <int>, TrainingTimesLastYear <int>,
## # WorkLifeBalance <ord>, YearsAtCompany <int>, YearsInCurrentRole <int>,
## # YearsSinceLastPromotion <int>, YearsWithCurrManager <int>
tr_att_python <- tidyRules(c5_att, language = "python")
tr_att_sql <- tidyRules(c5_att, language = "sql")
head(tr_att_python$LHS)
## [1] "EnvironmentSatisfaction in ['Medium', 'High', 'Very_High'] and JobInvolvement in ['Medium', 'High', 'Very_High'] and OverTime == 'No' and TrainingTimesLastYear > 1 and WorkLifeBalance in ['Better', 'Best']"
## [2] "JobRole == 'Research_Scientist' and OverTime == 'No'"
## [3] "TotalWorkingYears > 2"
## [4] "JobLevel <= 1 and MonthlyIncome <= 2468 and OverTime == 'Yes' and TotalWorkingYears > 2 and YearsAtCompany <= 3"
## [5] "DailyRate <= 722 and JobLevel <= 1 and MonthlyIncome <= 2468 and OverTime == 'Yes' and TotalWorkingYears > 2"
## [6] "EnvironmentSatisfaction in ['Low', 'Medium'] and MaritalStatus in ['Divorced', 'Married'] and NumCompaniesWorked > 4 and OverTime == 'Yes' and PerformanceRating == 'Excellent' and RelationshipSatisfaction in ['Low', 'Medium', 'High']"
head(tr_att_sql$LHS)
## [1] "EnvironmentSatisfaction IN ('Medium', 'High', 'Very_High') AND JobInvolvement IN ('Medium', 'High', 'Very_High') AND OverTime = 'No' AND TrainingTimesLastYear > 1 AND WorkLifeBalance IN ('Better', 'Best')"
## [2] "JobRole = 'Research_Scientist' AND OverTime = 'No'"
## [3] "TotalWorkingYears > 2"
## [4] "JobLevel <= 1 AND MonthlyIncome <= 2468 AND OverTime = 'Yes' AND TotalWorkingYears > 2 AND YearsAtCompany <= 3"
## [5] "DailyRate <= 722 AND JobLevel <= 1 AND MonthlyIncome <= 2468 AND OverTime = 'Yes' AND TotalWorkingYears > 2"
## [6] "EnvironmentSatisfaction IN ('Low', 'Medium') AND MaritalStatus IN ('Divorced', 'Married') AND NumCompaniesWorked > 4 AND OverTime = 'Yes' AND PerformanceRating = 'Excellent' AND RelationshipSatisfaction IN ('Low', 'Medium', 'High')"
rpart
In this example we will be using BreastCancer
data from mlbench
package.
library("tidyrules")
library("dplyr")
library("rpart")
# BreastCancer
data(BreastCancer, package = "mlbench")
bc_train <- BreastCancer %>%
select(-Id) %>%
mutate_if(is.ordered, function(x) x <- factor(x,ordered = F))
rpart_bc <- rpart(Class ~ ., data = bc_train)
NOTE : Do not forget to convert all ordered
features to factor
type before training the model.
One could visualize rpart decision tree using prp
function from rpart.plot
package.
library("rpart.plot")
prp(rpart_bc)
The above tree visual is really nice to get a hang of how decision tree is splitting at each node. But, if you want to pick a terminal node it is really boring and hard since one has to enter the respective filter manually (imagine a situation if you have hundreds of features and a huge tree!!). To get-ride of this problem one could use tidyrules to make life easier.
Let’s extract rules from rpart
object and use those rules further more to extract terminal nodes.
# tidyrule extract
rules_bc <- tidyRules(rpart_bc)
rules_bc
## # A tibble: 7 x 6
## id LHS RHS support confidence lift
## <int> <chr> <chr> <int> <dbl> <dbl>
## 1 1 Cell.size %in% c('1', '2') & Normal.nuc… benign 421 0.986 1.50
## 2 2 Cell.size %in% c('1', '2') & Normal.nuc… malig… 8 0.8 2.32
## 3 3 Cell.size %in% c('3', '4', '5', '6', '7… benign 16 0.944 1.44
## 4 4 Cell.size %in% c('3', '4', '5', '6', '7… malig… 7 0.667 1.93
## 5 5 Cell.size %in% c('3', '4', '5', '6', '7… benign 15 0.647 0.988
## 6 6 Cell.size %in% c('3', '4', '5', '6', '7… malig… 61 0.841 2.44
## 7 7 Cell.size %in% c('3', '4', '5', '6', '7… malig… 171 0.971 2.82
# filter the data using a rule
bc_train %>%
filter(eval(parse(text = rules_bc[5,"LHS"]))) %>%
as_tibble()
## # A tibble: 15 x 10
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 7 4 6 4 6 1
## 2 6 3 4 1 5 2
## 3 1 3 3 2 2 1
## 4 8 3 3 1 2 2
## 5 8 4 6 3 3 1
## 6 3 4 4 10 5 1
## 7 5 3 4 1 4 1
## 8 5 3 3 2 3 1
## 9 5 3 3 1 2 1
## 10 4 3 3 1 2 1
## 11 5 3 6 1 2 1
## 12 10 9 7 3 4 2
## 13 5 4 6 8 4 1
## 14 6 3 3 3 3 2
## 15 5 4 5 1 8 1
## # … with 4 more variables: Bl.cromatin <fct>, Normal.nucleoli <fct>,
## # Mitoses <fct>, Class <fct>
Cubist
In this example, rules extraction from a regression model (a cubist
model) has been illustrated below. We will be using AmesHousing
dataset for the example.
library("tidyrules")
library("dplyr")
library("Cubist")
# ames housing data set
ames <- AmesHousing::make_ames()
cubist_ames <- cubist(x = ames[, setdiff(colnames(ames), c("Sale_Price"))],
y = log10(ames[["Sale_Price"]]),
committees = 3
)
# rule extract
rules_ames <- tidyRules(cubist_ames)
rules_ames
## # A tibble: 43 x 9
## id LHS RHS support mean min max error committee
## <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 Overall_Qual … (-202.496569… 23 4.74 4.11 4.98 0.134 1
## 2 2 Overall_Qual … (-194.85918)… 125 4.94 4.54 5.17 0.0679 1
## 3 3 Overall_Qual … (12.372018) … 99 5.04 4.75 5.29 0.0649 1
## 4 4 Overall_Qual … (-47.97341) … 672 5.14 4.82 5.59 0.0338 1
## 5 5 MS_SubClass %… (-53.11988) … 358 5.17 4.79 5.38 0.0309 1
## 6 6 MS_SubClass %… (1.127729) +… 85 5.17 4.80 5.54 0.0632 1
## 7 7 MS_SubClass %… (10.129513) … 287 5.18 4.78 5.49 0.0388 1
## 8 8 MS_SubClass %… (5.675156) +… 748 5.21 4.80 5.59 0.0315 1
## 9 9 MS_SubClass %… (1.563904) +… 26 5.23 5 5.48 0.0748 1
## 10 10 Overall_Qual … (1.075936) +… 71 5.26 4.96 5.58 0.0681 1
## # … with 33 more rows
Notice that, for cubist
rules we have mean
, min
, max
and error
instead of confidence
and lift
. Here mean
, min
and max
are calculated based on predicted values with respect to a rule.