funModeling quick-start {#quick_start}

funModeling

This package contains a set of functions related to exploratory data analysis, data preparation, and model performance. It is used by people coming from business, research, and teaching (professors and students).

funModeling is intimately related to the Data Science Live Book -Open Source- (2017) in the sense that most of its functionality is used to explain different topics addressed by the book.

Data Science Live Book

Opening the black-box

Some functions have in-line comments so the user can open the black-box and learn how it was developed, or to tune or improve any of them.

All the functions are well documented, explaining all the parameters with the help of many short examples. R documentation can be accessed by: help("name_of_the_function").

About this quick-start

This quick-start is focused only on the functions. All explanations around them, and the how and when to use them, can be accessed by following the “Read more here.” links below each section, which redirect you to the book.

Below there are most of the funModeling functions divided by category.

Exploratory data analysis

`status`: Dataset health status (2nd version)

Similar to df_status, but it returns all percentages in the 0 to 1 range (not 1 to 100).

library(funModeling)

status(heart_disease)

##                  variable q_zeros   p_zeros q_na       p_na q_inf p_inf    type
## 1                     age       0 0.0000000    0 0.00000000     0     0 integer
## 2                  gender       0 0.0000000    0 0.00000000     0     0  factor
## 3              chest_pain       0 0.0000000    0 0.00000000     0     0  factor
## 4  resting_blood_pressure       0 0.0000000    0 0.00000000     0     0 integer
## 5       serum_cholestoral       0 0.0000000    0 0.00000000     0     0 integer
## 6     fasting_blood_sugar     258 0.8514851    0 0.00000000     0     0  factor
## 7         resting_electro     151 0.4983498    0 0.00000000     0     0  factor
## 8          max_heart_rate       0 0.0000000    0 0.00000000     0     0 integer
## 9             exer_angina     204 0.6732673    0 0.00000000     0     0 integer
## 10                oldpeak      99 0.3267327    0 0.00000000     0     0 numeric
## 11                  slope       0 0.0000000    0 0.00000000     0     0 integer
## 12      num_vessels_flour     176 0.5808581    4 0.01320132     0     0 integer
## 13                   thal       0 0.0000000    2 0.00660066     0     0  factor
## 14 heart_disease_severity     164 0.5412541    0 0.00000000     0     0 integer
## 15           exter_angina     204 0.6732673    0 0.00000000     0     0  factor
## 16      has_heart_disease       0 0.0000000    0 0.00000000     0     0  factor
##    unique
## 1      41
## 2       2
## 3       4
## 4      50
## 5     152
## 6       2
## 7       3
## 8      91
## 9       2
## 10     40
## 11      3
## 12      4
## 13      3
## 14      5
## 15      2
## 16      2

Note: df_status will be deprecated, please use status instead.

`data_integrity`: Dataset health status (2nd version)

A handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix).

It also returns a vector of variables which have high cardinality.

It returns an 'integrity' object, which has: 'status_now' (comes from status function), and 'results' list, following elements can be found: vars_cat, vars_num, vars_num_with_NA, etc. Explore the object for more.

library(funModeling)

di=data_integrity(heart_disease)

# returns a summary
summary(di)

## 
## ◌ {Numerical with NA} num_vessels_flour
## ◌ {Categorical with NA} thal

# print all the metadata information
print(di)

## $vars_num_with_NA
##            variable q_na       p_na
## 1 num_vessels_flour    4 0.01320132
## 
## $vars_cat_with_NA
##   variable q_na       p_na
## 1     thal    2 0.00660066
## 
## $vars_cat_high_card
## [1] variable unique  
## <0 rows> (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## character(0)
## 
## $vars_cat
## [1] "gender"              "chest_pain"          "fasting_blood_sugar"
## [4] "resting_electro"     "thal"                "exter_angina"       
## [7] "has_heart_disease"  
## 
## $vars_num
## [1] "age"                    "resting_blood_pressure" "serum_cholestoral"     
## [4] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [7] "slope"                  "num_vessels_flour"      "heart_disease_severity"
## 
## $vars_char
## character(0)
## 
## $vars_factor
## [1] "gender"              "chest_pain"          "fasting_blood_sugar"
## [4] "resting_electro"     "thal"                "exter_angina"       
## [7] "has_heart_disease"  
## 
## $vars_other
## character(0)

`plot_num`: Plotting distributions for numerical variables

Plots only numeric variables.

plot_num(heart_disease)

plot of chunk unnamed-chunk-3

Notes:

bins: Sets the number of bins (10 by default).
path_out indicates the path directory; if it has a value, then the plot is exported in jpeg. To save in current directory path must be dot: “.”

funModeling quick-start {#quick_start}

Opening the black-box

About this quick-start

Exploratory data analysis

status: Dataset health status (2nd version)

data_integrity: Dataset health status (2nd version)

plot_num: Plotting distributions for numerical variables

profiling_num: Calculating several statistics for numerical variables

freq: Getting frequency distributions for categoric variables

Correlation

correlation_table: Calculates R statistic

var_rank_info: Correlation based on information theory

cross_plot: Distribution plot between input and target variable

plotar: Boxplot and density histogram between input and target variables

categ_analysis: Quantitative analysis for binary outcome

Data preparation

Data discretization

discretize_get_bins + discretize_df: Convert numeric variables to categoric

equal_freq: Convert numeric variable to categoric

discretize_rgr: Variable discretization based on gain ratio maximization

range01: Scales variable into the 0 to 1 range

Outliers data preparation

hampel_outlier and tukey_outlier: Gets outliers threshold

prep_outliers: Prepare outliers in a data frame

Predictive model performance

gain_lift: Gain and lift performance curve

`status`: Dataset health status (2nd version)

`data_integrity`: Dataset health status (2nd version)

`plot_num`: Plotting distributions for numerical variables

`profiling_num`: Calculating several statistics for numerical variables

`freq`: Getting frequency distributions for categoric variables

`correlation_table`: Calculates R statistic

`var_rank_info`: Correlation based on information theory

`cross_plot`: Distribution plot between input and target variable

`plotar`: Boxplot and density histogram between input and target variables

`categ_analysis`: Quantitative analysis for binary outcome

`discretize_get_bins` + `discretize_df`: Convert numeric variables to categoric

`equal_freq`: Convert numeric variable to categoric

`discretize_rgr`: Variable discretization based on gain ratio maximization

`range01`: Scales variable into the 0 to 1 range

`hampel_outlier` and `tukey_outlier`: Gets outliers threshold

`prep_outliers`: Prepare outliers in a data frame

`gain_lift`: Gain and lift performance curve