The aim of this guide is to give a brief introduction and explanation to the functions contained inside the vivid
package. vivid
(variable importance and variable interaction displays) is used for investigating relationships within a machine learning model fit. All of the visualisations in this vignette are highly customizable (see the long-form vivid
vignette to see examples) and return ggplot
objects which can be customized using the normal ggplot
options.
Note: For the purposes of speed, the grid size (i.e., gridSize
- the size of the gid on which the evaluations are made) and the number of rows subsetted (nmax
) are small. This achieve more accurate results, incerease both the grid size and the number of rows used.
Some of the plots used by vivid
are built upon the zenplots
package which requires the graph
package from BioConductor. To install the graph
and zenplots
packages use:
if (!requireNamespace("graph", quietly = TRUE)){
install.packages("BiocManager")
BiocManager::install("graph")
}
install.packages("zenplots")
Now we can install and load vivid
by using:
install.packages("vivid")
We then load the other required packages.
library(vivid) # for visualisations
library(randomForest) # for model fit
library(ranger) # for model fit
library(ggplot2)
The data used in the following examples is simulated from the Friedman benchmark problem 11. This benchmark problem is commonly used for testing purposes. The output is created according to the equation:
Create the data:
set.seed(101)
<- function(noFeatures = 10,
genFriedman noSamples = 100,
sigma = 1) {
# Set Values
<- noSamples # no of rows
n <- noFeatures # no of variables
p <- rnorm(n, sd = sigma)
e
# Create matrix of values
<- matrix(runif(n * p, 0, 1), nrow = n) # Create matrix
xValues colnames(xValues) <- paste0("x", 1:p) # Name columns
<- data.frame(xValues) # Create dataframe
df
# Equation:
# y = 10sin(πx1x2) + 20(x3−0.5)^2 + 10x4 + 5x5 + ε
<- (10 * sin(pi * df$x1 * df$x2) + 20 * (df$x3 - 0.5)^2 + 10 * df$x4 + 5 * df$x5 + e)
y # Adding y to df
$y <- y
df
df
}
<- genFriedman(noFeatures = 9, noSamples = 350, sigma = 1) myData
In the following examples, we use a ranger
random forest model fit on the data, with the importance set to permutation
.
set.seed(101)
<- randomForest(y ~ ., data = myData) fit
Next, we create the ‘vivi-matrix’, which will contain variable importance on the diagonal and variable interactions in the upper and lower triangle. This matrix can then be supplied to the vivid
plotting functions.
set.seed(101)
<- vivi(
viFit fit = fit,
data = myData,
response = "y",
gridSize = 10,
importanceType = NULL,
nmax = 100,
reorder = TRUE,
class = 1,
predictFun = NULL
)
#Section 2: Visualizing the results
To create a heatmap of the vivi-matrix, we use:
viviHeatmap(mat = viFit) + ggtitle("random forest fit heatmap")
To create a network graph of the vivi-matrix, we use:
viviNetwork(mat = viFit)
In this plot we use a generalized pairs plot matrix style layout (which we call GPDP) to display partial dependence plots (PDPs) in the upper triangle, individual conditional exception curves (along with the aggregated 1-way partial dependence) on the diagonal and a scatterplot in the lower triangle.
To create the plot we supply the model fit to the plotting function.
set.seed(1701)
pdpPairs(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
For this plot, we calculate the bivariate partial dependence and display them in a zenplots layout, which we call (ZPDP). The ZPDP is based on graph Eulerians and focuses on key subsets. ‘Zenplots’ create a zigzag expanded navigation plot (‘zenplot’) of the partial dependence values. This results in an alternating sequence of two-dimensional plots laid out in a zigzag structure, as shown in Fig 4.0 below and can be used as a useful space-saving plot that displays the most influential variables.
set.seed(1701)
pdpZen(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
In Fig 4.0, we can see PDPs laid out in a zigzag structure, with the most influential variable pairs displayed at the top. As we move down the plot, we also move down in influence of the variable pairs.
Using the zpath
argument, we can filter out any interactions below a set value. zpath
takes the vivi matrix as a function argument and then, using cutoff
, we can filter out any interactions below the chosen value. For example:
set.seed(1701)
<- zPath(viv = viFit, cutoff = 0.1)
zpath pdpZen(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10, zpath = zpath)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
In this section, we briefly describe how to apply the above visualisations to a classification example using the iris
data set.
To begin we fit a ranger
random forest model with “Species” as the response and create the vivi matrix setting the category for classification to be “setosa” using class
.
set.seed(1701)
<- ranger(Species ~ .,
rfClassif data = iris, probability = T,
importance = "impurity"
)
set.seed(101)
<- vivi(
viviClassif fit = rfClassif,
data = iris,
response = "Species",
gridSize = 10,
importanceType = NULL,
nmax = 50,
reorder = TRUE,
class = "setosa",
predictFun = NULL
)#> Embedded impurity variable importance method used.
#> Calculating interactions...
Next we plot the heatmap and network plot of the iris data.
set.seed(1701)
viviHeatmap(mat = viviClassif)
set.seed(1701)
viviNetwork(mat = viviClassif)
As PDPs are evaluated on a grid, they can extrapolate where there is no data. To solve this issue we calculate a convex hull around the data and remove any points that fall outside the convex hull. This can be seen in the GPDP in Fig 3.2 below.
set.seed(1701)
pdpPairs(data = iris, fit = rfClassif, response = "Species", class = "setosa", convexHull = T, gridSize = 10, nmax = 50)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
Finally, a ZPDP for the random forest fit on the iris data with extrapolated data removed:
set.seed(1701)
pdpZen(data = iris, fit = rfClassif, response = "Species", class = "setosa", convexHull = T, gridSize = 10, nmax = 50)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdp
Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.↩︎