The document introduces the DriveML package and how it can help you to build effortless machine learning binary classification models in a short period.
DriveML is a series of functions such as AutoDataPrep
, AutoMAR
, autoMLmodel
. DriveML automates some of the complicated machine learning functions such as exploratory data analysis, data pre-processing, feature engineering, model training, model validation, model tuning and model selection.
This package automates the following steps on any input dataset for machine learning classification problems
Additionally, we are providing a function SmartEDA for Exploratory data analysis that generates automated EDA report in HTML format to understand the distributions of the data. Please note there are some dependencies on some other R packages such as MLR, caret, data.table, ggplot2, etc. for some specific task.
To summarize, DriveML package helps in getting the complete Machine learning classification model just by running the function instead of writing lengthy r code.
Algorithm: Missing at random features
The DriveML R package has three unique functions
autoDataPrep
function to generate a novel features based on the functional understanding of the datasetautoMLmodel
function to develop baseline machine learning models using regression and tree based classification techniquesautoMLReport
function to print the machine learning model outcome in HTML formatThis database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
Data Source https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Install the package “DriveML” to get the example data set.
library("DriveML")
library("SmartEDA")
## Load sample dataset from ISLR pacakge
data(heart)
more detailed attribute information is there in DriveML
help page
For data exploratory analysis used SmartEDA
package
Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables
# Overview of the data - Type = 1
ExpData(data=heart,type=1)
# Structure of the data - Type = 2
ExpData(data=heart,type=2)
Descriptions | Value |
---|---|
Sample size (nrow) | 303 |
No. of variables (ncol) | 14 |
No. of numeric/interger variables | 14 |
No. of factor variables | 0 |
No. of text variables | 0 |
No. of logical variables | 0 |
No. of identifier variables | 0 |
No. of date variables | 0 |
No. of zero variance variables (uniform) | 0 |
%. of variables having complete cases | 100% (14) |
%. of variables having >0% and <50% missing cases | 0% (0) |
%. of variables having >=50% and <90% missing cases | 0% (0) |
%. of variables having >=90% missing cases | 0% (0) |
Index | Variable_Name | Variable_Type | Sample_n | Missing_Count | Per_of_Missing | No_of_distinct_values |
---|---|---|---|---|---|---|
1 | age | integer | 303 | 0 | 0 | 41 |
2 | sex | integer | 303 | 0 | 0 | 2 |
3 | cp | integer | 303 | 0 | 0 | 4 |
4 | trestbps | integer | 303 | 0 | 0 | 49 |
5 | chol | integer | 303 | 0 | 0 | 152 |
6 | fbs | integer | 303 | 0 | 0 | 2 |
7 | restecg | integer | 303 | 0 | 0 | 3 |
8 | thalach | integer | 303 | 0 | 0 | 91 |
9 | exang | integer | 303 | 0 | 0 | 2 |
10 | oldpeak | numeric | 303 | 0 | 0 | 40 |
11 | slope | integer | 303 | 0 | 0 | 3 |
12 | ca | integer | 303 | 0 | 0 | 5 |
13 | thal | integer | 303 | 0 | 0 | 4 |
14 | target_var | integer | 303 | 0 | 0 | 2 |
ExpNumStat(heart,by="GA",gp="target_var",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)
Box plots for all numerical variables vs categorical dependent variable - Bivariate comparison only with classes
Boxplot for all the numerical attributes by each class of the target variable
<- ExpNumViz(heart,target="target_var",type=1,nlim=3,fname=NULL,Page=c(2,2),sample=8)
plot4 1]] plot4[[
Cross tabulation with target_var variable
Custom tables between all categorical independent variables and the target variable
ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F)
VARIABLE | CATEGORY | target_var:0 | target_var:1 | TOTAL |
---|---|---|---|---|
sex | 0 | 24 | 72 | 96 |
sex | 1 | 114 | 93 | 207 |
sex | TOTAL | 138 | 165 | 303 |
fbs | 0 | 116 | 142 | 258 |
fbs | 1 | 22 | 23 | 45 |
fbs | TOTAL | 138 | 165 | 303 |
restecg | 0 | 79 | 68 | 147 |
restecg | 1 | 56 | 96 | 152 |
restecg | 2 | 3 | 1 | 4 |
restecg | TOTAL | 138 | 165 | 303 |
exang | 0 | 62 | 142 | 204 |
exang | 1 | 76 | 23 | 99 |
exang | TOTAL | 138 | 165 | 303 |
slope | 0 | 12 | 9 | 21 |
slope | 1 | 91 | 49 | 140 |
slope | 2 | 35 | 107 | 142 |
slope | TOTAL | 138 | 165 | 303 |
target_var | 0 | 138 | 0 | 138 |
target_var | 1 | 0 | 165 | 165 |
target_var | TOTAL | 138 | 165 | 303 |
Stacked bar plot with vertical or horizontal bars for all categorical variables
<- ExpCatViz(heart,target = "target_var", fname = NULL, clim=5,col=c("slateblue4","slateblue1"),margin=2,Page = c(2,1),sample=2)
plot5 1]] plot5[[
ExpOutliers(heart, varlist = c("oldpeak","trestbps","chol"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9))
Category | oldpeak | trestbps | chol |
---|---|---|---|
Lower cap : 0.1 | 0 | 110 | 188 |
Upper cap : 0.9 | 2.8 | 152 | 308.8 |
Lower bound | -2.4 | 90 | 115.75 |
Upper bound | 4 | 170 | 369.75 |
Num of outliers | 5 | 9 | 5 |
Lower outlier case | |||
Upper outlier case | 102,205,222,251,292 | 9,102,111,204,224,242,249,261,267 | 29,86,97,221,247 |
Mean before | 1.04 | 131.62 | 246.26 |
Mean after | 0.97 | 130.1 | 243.04 |
Median before | 0.8 | 130 | 240 |
Median after | 0.65 | 130 | 240 |
autoDataprep
Data preparation using DriveML autoDataprep function with default options
<- autoDataprep(data = heart,
dateprep target = 'target_var',
missimpute = 'default',
auto_mar = FALSE,
mar_object = NULL,
dummyvar = TRUE,
char_var_limit = 15,
aucv = 0.002,
corr = 0.98,
outlier_flag = TRUE,
uid = NULL,
onlykeep = NULL,
drop = NULL)
<- dateprep$master_data train_data
We can use different types of missing imputation using mlr::impute function
<- list(classes=list(factor = imputeMode(),
myimpute integer = imputeMean(),
numeric = imputeMedian(),
character = imputeMode()))
<- autoDataprep(data = heart,
dateprep target = 'target_var',
missimpute = myimpute,
auto_mar = FALSE,
mar_object = NULL,
dummyvar = TRUE,
char_var_limit = 15,
aucv = 0.002,
corr = 0.98,
outlier_flag = TRUE,
uid = NULL,
onlykeep = NULL,
drop = NULL)
<- dateprep$master_data train_data
Adding Missing at Random features using autoMAR function
<- autoMAR (heart, aucv = 0.9, strataname = NULL, stratasize = NULL, mar_method = "glm") marobj
## less than or equal to one missing value coloumn found in the dataframe
<- autoDataprep(data = heart,
dateprep target = 'target_var',
missimpute = myimpute,
auto_mar = TRUE,
mar_object = marobj,
dummyvar = TRUE,
char_var_limit = 15,
aucv = 0.002,
corr = 0.98,
outlier_flag = TRUE,
uid = NULL,
onlykeep = NULL,
drop = NULL)
<- dateprep$master_data train_data
autoMLmodel
Automated training, tuning and validation of machine learning models. This function includes the following binary classification techniques
+ Logistic regression - logreg
+ Regularised regression - glmnet
+ Extreme gradient boosting - xgboost
+ Random forest - randomForest
+ Random forest - ranger
+ Decision tree - rpart
<- autoMLmodel( train = heart,
mymodel test = NULL,
target = 'target_var',
testSplit = 0.2,
tuneIters = 100,
tuneType = "random",
models = "all",
varImp = 10,
liftGroup = 50,
maxObs = 4000,
uid = NULL,
htmlreport = FALSE,
seed = 1991)
Model performance
Model | Fitting time | Scoring time | Train AUC | Test AUC | Accuracy | Precision | Recall | F1_score |
---|---|---|---|---|---|---|---|---|
glmnet | 2.165 secs | 0.007 secs | 0.928 | 0.908 | 0.820 | 0.824 | 0.848 | 0.836 |
logreg | 2.011 secs | 0.004 secs | 0.929 | 0.906 | 0.820 | 0.824 | 0.848 | 0.836 |
randomForest | 2.257 secs | 0.011 secs | 1.000 | 0.874 | 0.770 | 0.771 | 0.818 | 0.794 |
ranger | 2.312 secs | 0.046 secs | 1.000 | 0.896 | 0.787 | 0.778 | 0.848 | 0.812 |
xgboost | 2.927 secs | 0.005 secs | 1.000 | 0.874 | 0.770 | 0.757 | 0.848 | 0.800 |
rpart | 1.922 secs | 0.004 secs | 0.927 | 0.814 | 0.738 | 0.730 | 0.818 | 0.771 |
Randomforest model Receiver Operating Characteristic (ROC) and the variable Importance
Training dataset ROC
<- mymodel$trainedModels$randomForest$modelPlots$TrainROC
TrainROC TrainROC
Test dataset ROC
<- mymodel$trainedModels$randomForest$modelPlots$TestROC
TestROC TestROC
Variable importance
<- mymodel$trainedModels$randomForest$modelPlots$VarImp
VarImp VarImp
## [[1]]
Threshold
<- mymodel$trainedModels$randomForest$modelPlots$Threshold
Threshold Threshold