The creditmodel
package provides a highly efficient R tool suite for Credit Modeling Analysis and Visualization. Contains infrastructure functionalities such as data exploration and preparation, missing values treatment, outliers treatment, variable derivation, variable selection, dimensionality reduction, grid search for hyper parameters, data mining and visualization, model evaluation, strategy analysis etc. creditmodel can facilitate reliable predictive models (such as xgboost or scorecard) and data analysis on a standard laptop computer within minutes. This introductory vignette provides a brief glance at the training_model module of the package.
When I first wrote the creditmodel package, its primary purpose was to provide a tool to make the development of binary classification models (machine learning based models as well as credit scorecard) simpler and faster. Therefore, I wrote the package to automatically build model. However, as the package grew in functionality, this choice was increasingly problematic.
Importantly, the creditmodel package now provides a set of complementary tools with different missions.
Now, Let’s start with quick modeling.
## -- Building --------------------------------------------------- UCICreditCard --
## -- Creating the model output file path -----------------------------------------
## -- Seting model output file path:
## * model : C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/model
## * data : C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/data
## * variable : C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable
## * performance: C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/performance
## * predict : C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/predict
## -- Checking datasets and target ------------------------------------------------
## -- Cleansing & Prepocessing data -----------------------------------------------
## -- Cleansing data
## -- Checking data and target format...
## -- Replacing null or blank or miss_values with NA
## -- Formating time variables
## -- Deleting low variance variables
## -- Processing NAs & special value rate is more than 0.98
## -- Transfering character variables which are actually numerical to numeric
## -- Removing duplicated observations
## -- Merging categories...
## -- Saving data_cleansing to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/data/data_cleansing.csv
## -- Logarithmic transformation
## -- Following variables are log transformed:
## * LIMIT_BAL -> LIMIT_BAL_log
## * PAY_0 -> PAY_0_log
## * PAY_2 -> PAY_2_log
## * PAY_AMT1 -> PAY_AMT1_log
## * PAY_AMT2 -> PAY_AMT2_log
## * PAY_AMT3 -> PAY_AMT3_log
## * PAY_AMT4 -> PAY_AMT4_log
## * PAY_AMT5 -> PAY_AMT5_log
## * PAY_AMT6 -> PAY_AMT6_log
## -- Spliting train & test -------------------------------------------------------
## -- train_test_split:
## * Total: 30000 (100%)
## * Train: 20874 (70%)
## * Test : 9126 (30%)
## -- Processing outliers using Kmeans and LOF
## * LIMIT_BAL_log 0% no_outlier
## * AGE 0% no_outlier
## * PAY_0_log 0% no_outlier
## * PAY_2_log 0% no_outlier
## * PAY_3 0% no_outlier
## * PAY_4 0% no_outlier
## * PAY_5 0% no_outlier
## * PAY_6 0% no_outlier
## * BILL_AMT1 0% no_outlier
## * BILL_AMT2 0% no_outlier
## * BILL_AMT3 0% no_outlier
## * BILL_AMT4 0% no_outlier
## * BILL_AMT5 0% no_outlier
## * BILL_AMT6 0% no_outlier
## * PAY_AMT1_log 0% no_outlier
## * PAY_AMT2_log 0% no_outlier
## * PAY_AMT3_log 0% no_outlier
## * PAY_AMT4_log 0% no_outlier
## * PAY_AMT5_log 0% no_outlier
## * PAY_AMT6_log 0% no_outlier
## -- Saving data_outlier_proc to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/data/data_outlier_proc.csv
## -- Filtering features ----------------------------------------------------------
## -- Feature filtering by IV
## -- Feature filtering by Correlation
## -- Saving feature_filter to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/feature_filter.csv
## -- Saving feature_filter_table to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/feature_filter_table.csv
## -- Training logistic regression model/scorecard --------------------------------
## -- Searching optimal binning & feature selection parameters --------------------
## [1] train_ks:0.4133 test_ks:0.3861 psi:0
## * tree_control:{ p:0.02, cp:0.00000001, xval:5, maxdepth:10 }
## * bins_control:{ bins_num:10, bins_pct:0.03, b_chi:0.01, b_odds:0.1, b_psi:0.02, b_or:0.05, mono:0.4, odds_psi:0.2, kc:1 }
## * thresholds:{ cor_p:0.8, iv_i:0.02, psi_i:0.1, cos_i:0.5 }
## [2] train_ks:0.4239 test_ks:0.3898 psi:0
## * tree_control:{ p:0.02, cp:0.00001, xval:5, maxdepth:15 }
## * bins_control:{ bins_num:10, bins_pct:0.03, b_chi:0.01, b_odds:0.1, b_psi:0.06, b_or:0.2, mono:0.5, odds_psi:0.1, kc:1 }
## * thresholds:{ cor_p:0.8, iv_i:0.02, psi_i:0.1, cos_i:0.5 }
## -- [best iter] -----------------------------------------------------------------
## [2] train_ks:0.4239 test_ks:0.3898 psi:0
## * tree_control:{ p:0.02, cp:0.00001, xval:5, maxdepth:15 }
## * bins_control:{ bins_num:10, bins_pct:0.03, b_chi:0.01, b_odds:0.1, b_psi:0.06, b_or:0.2, mono:0.5, odds_psi:0.1, kc:1 }
## * thresholds:{ cor_p:0.8, iv_i:0.02, psi_i:0.1, cos_i:0.5 }
## -- Constrained optimal binning of varibles -------------------------------------
## -- Getting optimal binning breaks
## * PAY_0_log: -0.5,0.346573590279972,Inf
## * PAY_2_log: -0.5,0.346573590279972,Inf
## * PAY_3: -1,1,Inf
## * PAY_4: -1,0,Inf
## * PAY_5: -1,1,Inf
## * PAY_AMT1_log: 3.06778244554087,7.38056741199529,7.49526408891739,7.60115239706291,8.49719452409339,Inf
## * PAY_AMT2_log: 4.87899962289376,8.51228114200585,8.69959807459985,Inf
## -- Saving breaks_list.breaks_list to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/breaks_list.breaks_list.csv
## -- Filtering variables by IV & PSI ---------------------------------------------
## -- Selecting variables by PSI & IV
## -- Calculating PSI
## --PAY_0_log
## * PSI: 0 --> Very stable
## --PAY_2_log
## * PSI: 0 --> Very stable
## --PAY_3
## * PSI: 0 --> Very stable
## --PAY_4
## * PSI: 0 --> Very stable
## --PAY_5
## * PSI: 0 --> Very stable
## --PAY_AMT1_log
## * PSI: 0 --> Very stable
## --PAY_AMT2_log
## * PSI: 0 --> Very stable
## -- Calculating IV
## --PAY_0_log
## * IV: 0.692 --> Very Strong
## --PAY_2_log
## * IV: 0.538 --> Very Strong
## --PAY_3
## * IV: 0.405 --> Very Strong
## --PAY_4
## * IV: 0.352 --> Very Strong
## --PAY_5
## * IV: 0.314 --> Very Strong
## --PAY_AMT1_log
## * IV: 0.184 --> Strong
## --PAY_AMT2_log
## * IV: 0.148 --> Strong
## -- Saving feature.IV_PSI to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/feature.IV_PSI.csv
## -- Saving feature.PSI to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/feature.PSI.csv
## -- Saving feature.IV to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/feature.IV.csv
## -- Saving LR.IV_PSI_features to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/LR.IV_PSI_features.csv
## -- Transforming WOE ------------------------------------------------------------
## -- Transforming variables to woe
## -- Saving lr_train.dat.woe to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/data/LR/lr_train.dat.woe.csv
## -- Filtering variables by correlation ------------------------------------------
## -- Processing bins table
## * PAY_0_log IV: 0.692 PSI: 0
## * PAY_2_log IV: 0.537 PSI: 0
## * PAY_3 IV: 0.406 PSI: 0
## * PAY_4 IV: 0.352 PSI: 0
## * PAY_5 IV: 0.314 PSI: 0
## * PAY_AMT1_log IV: 0.184 PSI: 0
## * PAY_AMT2_log IV: 0.148 PSI: 0
## -- Filtering variables by LASSO ------------------------------------------------
## Saving 8 x 5 in image
## -- Saving lr_premodel_features to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/lr_premodel_features.csv
## -- Start training lr model -----------------------------------------------------
## -- Saving lr_model_features to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/variable/LR/lr_model_features.csv
## -- Saving UCICreditCard.lr_coef to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/performance/LR/UCICreditCard.lr_coef.csv
## -- Generating standard socrecard -----------------------------------------------
## -- Using scorecard to predict the train and test
## -- Saving lr_train_score to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/predict/LR/lr_train_score.csv
## -- Saving lr_test_score to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/predict/LR/lr_test_score.csv
## -- Saving lr_train_prob to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/predict/LR/lr_train_prob.csv
## -- Saving lr_test_prob to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/predict/LR/lr_test_prob.csv
## -- Producing plots that characterize performance of scorecard
## Saving 12 x 5 in image
## -- Saving UCICreditCard.LR.performance_table to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/performance/LR/UCICreditCard.LR.performance_table.csv
## -- Saving LR.params to:
## * C:\Users\HANSEN\AppData\Local\Temp\Rtmp02ToD9/UCICreditCard/performance/LR/LR.params.csv
In a few minutes, the program completed data cleaning and pretreatment, variable screening, scorecard, Xgboost, GBDT, RandomForest four models development and evaluation.