Building predictive models with manymodelr

First, a word of caution. The examples shown in this section are meant to simply show what the functions do and not what the best model is. For a specific use case, please perform the necessary model checks, post-hoc analyses, and/or choose predictor variables and model types as appropriate based on domain knowledge.

With this in mind, let us look at how we can perform modeling tasks using manymodelr.

multi_model_1

This is one of the core functions of the package. multi_model_1 aims to allow model fitting, prediction, and reporting with a single function. The multi part of the function’s name reflects the fact that we can fit several model types with one function. An example follows next.

For purposes of this report, we create a simple dataset to use.

library(manymodelr)
set.seed(520)
# Create a simple dataset with a binary target
# Here normal is a fictional target where we assume that it meets 
# some criterion means 
sample_data <- data.frame(normal = as.factor(rep(c("Yes", "No"), 500)), 
                          height=rnorm(100, mean=0.5, sd=0.2), 
                          weight=runif(100,0, 0.6),
                          yield = rnorm(100, mean =520, sd = 10))

head(sample_data)
#>   normal    height     weight    yield
#> 1    Yes 0.2849090 0.13442312 520.2837
#> 2     No 0.2427826 0.37484971 504.4754
#> 3    Yes 0.2579432 0.47134828 515.6463
#> 4     No 0.5175604 0.50143592 522.2247
#> 5    Yes 0.4026023 0.47171755 502.6406
#> 6     No 0.9789886 0.04191937 509.4663

set.seed(520)
train_set<-createDataPartition(sample_data$normal,p=0.6,list=FALSE)
valid_set<-sample_data[-train_set,]
train_set<-sample_data[train_set,]
ctrl<-trainControl(method="cv",number=5)
m<-multi_model_1(train_set,"normal",".",c("knn","rpart"), 
                 "Accuracy",ctrl,new_data =valid_set)

The above returns a list containing metrics, predictions, and a model summary. These can be extracted as shown below.


m$metric
#> # A tibble: 1 x 2
#>   knn_accuracy rpart_accuracy
#>          <dbl>          <dbl>
#> 1        0.872           0.68


head(m$predictions)
#> # A tibble: 6 x 2
#>   knn   rpart
#>   <chr> <chr>
#> 1 Yes   Yes  
#> 2 No    Yes  
#> 3 No    No   
#> 4 No    Yes  
#> 5 No    No   
#> 6 Yes   Yes

multi_model_2

This is similar to multi_model_1 with one difference: it does not use metrics such as RMSE, accuracy and the like. This function is useful if one would like to fit and predict “simpler models” like generalized linear models or linear models. Let’s take a look:

# fit a linear model and get predictions
lin_model <- multi_model_2(mtcars[1:16,],mtcars[17:32,],"mpg","wt","lm")

lin_model[c("predicted", "mpg")]
#>                     predicted  mpg
#> Mazda RX4            10.17314 21.0
#> Mazda RX4 Wag        24.32264 21.0
#> Datsun 710           26.95458 22.8
#> Hornet 4 Drive       25.96479 21.4
#> Hornet Sportabout    23.13039 18.7
#> Valiant              18.38390 18.1
#> Duster 360           18.76632 14.3
#> Merc 240D            16.94420 24.4
#> Merc 230             16.92171 22.8
#> Merc 280             25.51488 19.2
#> Merc 280C            24.59258 17.8
#> Merc 450SE           27.41348 16.4
#> Merc 450SL           19.95856 17.3
#> Merc 450SLC          21.75818 15.2
#> Cadillac Fleetwood   18.15895 10.4
#> Lincoln Continental  21.71319 10.4

From the above, we see that wt alone may not be a great predictor for mpg. We can fit a multi-linear model with other predictors. Let’s say disp and drat are important too, then we add those to the model.


multi_lin <- multi_model_2(mtcars[1:16, ], mtcars[17:32,],"mpg", "wt + disp + drat","lm")

multi_lin[,c("predicted", "mpg")]
#>                     predicted  mpg
#> Mazda RX4            10.43041 21.0
#> Mazda RX4 Wag        24.39765 21.0
#> Datsun 710           25.56629 22.8
#> Hornet 4 Drive       25.38957 21.4
#> Hornet Sportabout    23.15234 18.7
#> Valiant              17.36908 18.1
#> Duster 360           17.67102 14.3
#> Merc 240D            15.59802 24.4
#> Merc 230             14.96161 22.8
#> Merc 280             25.05592 19.2
#> Merc 280C            23.66222 17.8
#> Merc 450SE           25.95326 16.4
#> Merc 450SL           17.05637 17.3
#> Merc 450SLC          21.97756 15.2
#> Cadillac Fleetwood   17.22593 10.4
#> Lincoln Continental  22.17872 10.4

fit_model

This function allows us to fit any kind of model without necessarily returning predictions.

lm_model <- fit_model(mtcars,"mpg","wt","lm")
lm_model
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = use_df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      37.285       -5.344

fit_models

This is similar to fit_model with the ability to fit many models with many predictors at once. A simple linear model for instance:


models<-fit_models(df=sample_data,yname=c("height", "weight"),xname="yield",
                   modeltype="glm")

One can then use these models as one may wish. To add residuals from these models for example:



res_residuals <- lapply(models[[1]], add_model_residuals,sample_data)
res_predictions <- lapply(models[[1]], add_model_predictions, sample_data, sample_data)
# Get height predictions for the model height ~ yield 
head(res_predictions[[1]])
#>   normal    height     weight    yield predicted
#> 1    Yes 0.2849090 0.13442312 520.2837 0.5028866
#> 2     No 0.2427826 0.37484971 504.4754 0.4943626
#> 3    Yes 0.2579432 0.47134828 515.6463 0.5003860
#> 4     No 0.5175604 0.50143592 522.2247 0.5039331
#> 5    Yes 0.4026023 0.47171755 502.6406 0.4933732
#> 6     No 0.9789886 0.04191937 509.4663 0.4970537

If one would like to drop non-numeric columns from the analysis, one can set drop_non_numeric to TRUE as follows. The same can be done for fit_model above:

fit_models(df=sample_data,yname=c("height","weight"),
           xname=".",modeltype=c("lm","glm"), drop_non_numeric = TRUE)
#> [[1]]
#> [[1]][[1]]
#> 
#> Call:
#> lm(formula = height ~ ., data = use_df)
#> 
#> Coefficients:
#> (Intercept)       weight        yield  
#>   0.2176942   -0.2185572    0.0006712  
#> 
#> 
#> [[1]][[2]]
#> 
#> Call:
#> lm(formula = weight ~ ., data = use_df)
#> 
#> Coefficients:
#> (Intercept)       height        yield  
#>   0.0112753   -0.1463926    0.0006827  
#> 
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> 
#> Call:  glm(formula = height ~ ., data = use_df)
#> 
#> Coefficients:
#> (Intercept)       weight        yield  
#>   0.2176942   -0.2185572    0.0006712  
#> 
#> Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
#> Null Deviance:       45.82 
#> Residual Deviance: 44.32     AIC: -270.3
#> 
#> [[2]][[2]]
#> 
#> Call:  glm(formula = weight ~ ., data = use_df)
#> 
#> Coefficients:
#> (Intercept)       height        yield  
#>   0.0112753   -0.1463926    0.0006827  
#> 
#> Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
#> Null Deviance:       30.7 
#> Residual Deviance: 29.69     AIC: -671.1

Extraction of Model Information

To extract information about a given model, we can use extract_model_info as follows.


extract_model_info(lm_model, "r2")
#> [1] 0.7528328

To extract the adjusted R squared:


extract_model_info(lm_model, "adj_r2")
#> [1] 0.7445939

For the p value:


extract_model_info(lm_model, "p_value")
#>  (Intercept)           wt 
#> 8.241799e-19 1.293959e-10

To extract multiple attributes:


extract_model_info(lm_model,c("p_value","response","call","predictors"))
#> $p_value
#>  (Intercept)           wt 
#> 8.241799e-19 1.293959e-10 
#> 
#> $response
#> [1] "mpg"
#> 
#> $call
#> lm(formula = mpg ~ wt, data = use_df)
#> 
#> $predictors
#> [1] "wt"

This is not restricted to linear models but will work for most model types. See help(extract_model_info) to see currently supported model types.

Building predictive models with manymodelr

2021-11-12

Extraction of Model Information