The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also features a function cva.glmnet
to do crossvalidation for both \(\alpha\) and \(\lambda\), as well as some utility functions.
The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here are some simple examples for different types of data:
## Call:
## glmnetUtils:::glmnet.formula(formula = mpg ~ cyl + disp + hp,
## data = mtcars)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03326 0.11690 0.41003 1.02839 1.44125 5.05505
# multinomial logistic regression with specified elastic net alpha parameter
(irisMod <- glmnet(Species ~ ., data=iris, family="multinomial", alpha=0.5))
## Call:
## glmnetUtils:::glmnet.formula(formula = Species ~ ., data = iris,
## family = "multinomial", alpha = 0.5)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 0.5
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000870 0.0008707 0.0087093 0.0979220 0.0870709 0.8699915
# Poisson regression with an offset
(InsMod <- glmnet(Claims ~ District + Group + Age, data=MASS::Insurance,
family="poisson", offset=log(Holders)))
## Call:
## glmnetUtils:::glmnet.formula(formula = Claims ~ District + Group +
## Age, data = MASS::Insurance, family = "poisson", offset = log(Holders))
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02877 0.11613 0.46883 1.40515 1.89269 7.64083
Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. A simple print
method is also provided, to show the main model details at a glance. I’ll describe shortly what the “sparse model matrix” and “use model.frame” options do.
Predicting from a model works as you’d expect: just pass a data frame containing the new observations to the predict
method. You can also specify any arguments that predict.glmnet
accepts.
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
# multinomial logistic regression: get predicted class
predict(irisMod, newdata=iris, type="class")
# Poisson regression: need to specify offset
predict(InsMod, newdata=MASS::Insurance, offset=log(Holders))
If you want, you can still use the original model matrix-plus-response syntax:
mtcarsX <- as.matrix(mtcars[c("cyl", "disp", "hp")])
mtcarsY <- mtcars$mpg
mtcarsMod2 <- glmnet(mtcarsX, mtcarsY)
summary(as.numeric(predict(mtcarsMod, mtcars) -
predict(mtcarsMod2, mtcarsX)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
This shows that the resulting models are identical, in terms of the predictions they make and the regularisation parameters used.
There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame
and model.matrix
; and the second is to build the matrix one variable at a time.
model.frame
This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame
function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x)
or I(x^2)
on the RHS, model.frame
will evaluate these expressions and include them in the output.
The major disadvantage of using model.frame
is that it generates a terms
object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a \(p \times p\) square matrix where \(p\) is the number of main effects in the model. For wide datasets with \(p > 10000\), this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can be very slow.
Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an \(N\)-level factor into an indicator matrix with \(N-1\) columns, with one column being dropped. This is necessary for unregularised models as fit with lm
and glm
, since the full set of \(N\) columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.
This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
To deal with the problems above, glmnetUtils by default will avoid using model.frame
, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms
object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).
This works in an additive fashion, ie the formula ~ a + b:c + d*e
is treated as consisting of three terms, a
, b:c
and d*e
each of which is processed independently of the others. A dot in the formula includes all main effect terms, ie ~ . + a:b + f(x)
expands to ~ a + b + x + a:b + f(x)
(assuming a, b and x are the only columns in the data). Note that a formula like ~ (a + b) + (c + d)
will be treated as two terms, a + b
and c + d
.
To examine the speed impact of using model.frame
, let’s do some simple comparisons of run times. We’ll generate sample datasets with 100, 1,000 and 10,000 predictors, and then run glmnet
with both options for generating the model matrix.
# generate sample (uncorrelated) data of a given size
makeSampleData <- function(N, P)
{
X <- matrix(rnorm(N*P), nrow=N)
data.frame(y=rnorm(N), X)
}
# test for three sizes: 100/1000/10000 predictors
df1 <- makeSampleData(N=1000, P=100)
df2 <- makeSampleData(N=1000, P=1000)
df3 <- makeSampleData(N=1000, P=10000)
library(microbenchmark)
res <- microbenchmark(
glmnet(y ~ ., df1, use.model.frame=TRUE),
glmnet(y ~ ., df1, use.model.frame=FALSE),
glmnet(y ~ ., df2, use.model.frame=TRUE),
glmnet(y ~ ., df2, use.model.frame=FALSE),
glmnet(y ~ ., df3, use.model.frame=TRUE),
glmnet(y ~ ., df3, use.model.frame=FALSE),
times=10
)
print(res, unit="s", digits=2)
## Unit: seconds
## expr min lq mean median uq max neval
## glmnet(y ~ ., df1, use.model.frame = TRUE) 0.024 0.024 0.027 0.025 0.029 0.032 10
## glmnet(y ~ ., df1, use.model.frame = FALSE) 0.023 0.026 0.029 0.026 0.028 0.051 10
## glmnet(y ~ ., df2, use.model.frame = TRUE) 3.703 3.916 4.258 4.272 4.428 5.153 10
## glmnet(y ~ ., df2, use.model.frame = FALSE) 3.756 3.874 4.291 4.352 4.561 5.073 10
## glmnet(y ~ ., df3, use.model.frame = TRUE) 11.973 12.353 13.262 13.350 13.864 14.622 10
## glmnet(y ~ ., df3, use.model.frame = FALSE) 4.295 4.639 4.992 4.822 5.060 6.111 10
From this, we can see that for datasets with up to 1,000 predictors, both methods are about as fast as each other. However, for 10,000 predictors (not uncommon these days), the model.frame
method takes three times as long as building the model matrix term by term.
What happens if we take it up to 100,000 predictors? As seen below, the standard approach of using model.frame
fails when R runs out of memory: the data frame itself is about 800MB in size, but trying to allocate the terms
object requires more then 67GB. However, building the model matrix by term still works, and (on this machine) finishes in about two minutes.
## Warning in terms.formula(formula, data = data): Reached total allocation of
## 32666Mb: see help(memory.size)
## Error: cannot allocate vector of size 37.3 Gb
## Call:
## glmnet.formula(formula = y ~ ., data = df4, use.model.frame = FALSE)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002393 0.006506 0.017680 0.032470 0.048080 0.130700
As an option, glmnetUtils can also generate a sparse model matrix, using the sparse.model.matrix
function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels. This can be combined with both of the previously mentioned options for generating model matrices.
One piece missing from the standard glmnet package is a way of choosing \(\alpha\), the elastic net mixing parameter, similar to how cv.glmnet
chooses \(\lambda\), the shrinkage parameter. To fix this, glmnetUtils provides the cva.glmnet
function, which uses crossvalidation to examine the impact on the model of changing \(\alpha\) and \(\lambda\). The interface is the same as for the other functions:
# Leukemia dataset from Trevor Hastie's website:
# https://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData
leuk <- do.call(data.frame, Leukemia)
leukMod <- cva.glmnet(y ~ ., data=leuk, family="binomial")
leukMod
## Call:
## cva.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial")
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
## Number of crossvalidation folds for lambda: 10
cva.glmnet
uses the algorithm described in the help for cv.glmnet
, which is to fix the distribution of observations across folds and then call cv.glmnet
in a loop with different values of \(\alpha\). Optionally, you can parallelise this outer loop, by setting the outerParallel
argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:
parLapply
in the parallel package. To use this, set outerParallel
to a valid cluster object created by makeCluster
.rxExec
as supplied by Microsoft R Server’s RevoScaleR package. To use this, set outerParallel
to a valid compute context created by RxComputeContext
, or a character string specifying such a context.If the outer loop is run in parallel, cva.glmnet
can check if the inner loop (over \(\lambda\)) is also set to run in parallel, and disable this if it would lead to contention for cores.
Because crossvalidation is often a statistically noisy procedure, it doesn’t try to automatically choose \(\alpha\) and \(\lambda\) for you. Instead you can plot the output, to see how the results depend on the values of these parameters. Using this information, you can choose appropriate values for your data.
In this case, we see that values of \(\alpha\) close to \(1\) tend to lead to better accuracy. The curves don’t have a well-defined minimum, but they do flatten out for lower values of \(\lambda\). As the cv.glmnet
documentation recommends though, it’s a good idea to run cva.glmnet
multiple times to reduce the impact of noise.
A cva.glmnet
object contains a list of individual cv.glmnet
objects, corresponding to the different \(\alpha\) values tried. This lets you plot the crossvalidation results easily for a given \(\alpha\):
The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at hongooi73@gmail.com.