Estimating predictive models

David Kepplinger

2021-07-07

In this guide we will estimate predictive models by means of robust adaptive PENSE estimates for high-dimensional linear regression. These estimates can tolerate up to 50% of contamination, i.e., the adaptive PENSE estimates are reliable even if up to half the observations in the data set contain anomalous values. Compute adaptive PENSE estimates is implemented in the function adapense_cv() in the pense package.

While the following guide computes adaptive PENSE estimates, everything also applies to other estimates implemented in the pense package: non-adaptive PENSE estimates and regularized M-estimates. Non-adaptive PENSE estimates (computed by pense_cv()) are typically better at identifying all relevant predictors than adaptive PENSE. However, this comes at the price of often including a large number of irrelevant predictors as well. Regularized M-estimates (computed by pensem_cv()) can be more accurate than either PENSE or adaptive PENSE estimates, but may be unreliable in presence of highly detrimental contamination.

Computing adaptive PENSE estimates

First we need to load the pense package:

library(pense)
#> Loading required package: Matrix

Computing robust, regularized estimates for high-dimensional linear regression models can take a long time. To save time, many steps can be done in parallel. If your computer has more than 1 CPU core, you can harness more computing power by creating a cluster of R processes:

library(parallel)
# If you don't know how many CPU cores are available, first run `detectCores(logical = FALSE)`
cluster <- makeCluster(3)

This guide uses the following simulated data with 50 observations and 40 available predictors. The error distribution is a heavy-tailed t-distribution and only the first 3 predictors are truly relevant for predicting the response y:

set.seed(1234)
x <- matrix(rweibull(50 * 40, 2, 3), ncol = 40)
y <- 1 * x[, 1] - 1.1 * x[, 2] + 1.2 * x[, 3] + rt(nrow(x), df = 2)

To make the scenario more realistic, let’s add some contamination to the response value of the first 3 observations and to some predictors:

y[1:3] <- 5 * apply(x[1:3, ], 1, max)
x[3:6, 4:6] <- 1.5 * max(x) + abs(rcauchy(4 * 3))

Step 1: Computing the estimates

The first step is to compute adaptive PENSE estimates for a fixed value for hyper-parameter alpha and many different penalization levels (hyper-parameter lambda). The adapense_cv() function automatically determines a grid of penalization levels, with parameter nlambda= controlling the number of different penalization levels to be used (default 50). We are going to choose the penalization level which leads to a good balance between prediction accuracy and model size. The pense package can automatically evaluate prediction accuracy of the adaptive PENSE estimates via cross-validation.

In its simplest form, computing adaptive PENSE estimates and estimating their prediction accuracy is done with the code below. Here, prediction accuracy is estimated via 5-fold cross-validation, replicated 3 times:

set.seed(1234)
fit_075 <- adapense_cv(x, y, alpha = 0.75, cv_k = 5, cv_repl = 3, cl = cluster)

You should always set a seed prior to computing adaptive PENSE or PENSE estimates to ensure reproducibility of the internal cross-validation.

By default, adaptive PENSE estimates are computed with a breakdown point of ~25%. This means, the estimates are reliable if up to 25% of observations contain arbitrary contamination. If you suspect that a larger proportion of observations may be affected by contamination, you can increase the breakdown point of the estimates with argument bdp= to up to 50%. Note, however, that a higher breakdown point also leads to less accurate estimates.

Step 2: Assessing prediction performance

The plot() function for the object fit_075 shows the estimated prediction accuracy for all fitted models.

plot(fit_075)

Estimated prediction accuracy using 3 replications of 5-fold CV.

The plot shows the estimated scale of the prediction error for all 50 models. The penalization level leading to the best prediction performance is highlighted by a dark blue dot. If more than one CV replication was performed, the plot also shows a light blue dot, marking the most parsimonious model with prediction performance “indistinguishable” from the best model. The plot uses the “one-standard-error” rule using the minimum average scale of the prediction error plus 1 standard error of this estimated scale. You can adjust this rule to the “m-standard-error” with plot(fit_075, lambda = "m-se"), where m is any positive number (e.g., lambda = "1.5-se").

In our case, the estimated prediction performance is a bit unstable and has large variability. With such unstable estimates of prediction performance it is difficult to reliably select the penalization level. This is not unusual for robust estimators and can be improved by increasing the number of CV replications.

The more CV replications, the more accurate the estimates of prediction accuracy, but the longer the computing time. If we repeat step #1, but with 10 CV replications instead of 3, we get a more stable evaluation of prediction performance:

set.seed(1234)
fit_075 <- adapense_cv(x, y, alpha = 0.75, cv_k = 5, cv_repl = 10, cl = cluster)
plot(fit_075)

Estimated prediction accuracy using 10 replications of 5-fold CV.

With 10 replications, the error bars are still large, a testament to the contamination in the sample, but the general trend of the prediction performance is much clearer. Of course, you could increase the number of CV replications further to get an even smoother plot.

Step 3: Extracting coefficients

Once we are happy with the stability of the estimated prediction performance, we can extract summary information from the predictive model with

summary(fit_075)
#> Adaptive PENSE fit with prediction performance estimated by 10 replications of 
#> 5-fold cross-validation.
#> 
#> 15 out of 40 predictors have non-zero coefficients:
#> 
#>                Estimate
#> (Intercept)  1.72628025
#> X1           0.72014823
#> X2          -0.85532734
#> X3           0.76292814
#> X5           0.06732408
#> X12          0.11434099
#> X19         -0.23328622
#> X21          0.32518029
#> X22          0.06293208
#> X26         -0.42844671
#> X29          0.22662846
#> X30          0.01786396
#> X32         -0.02880954
#> X36         -0.30372631
#> X38         -0.04016458
#> X39         -0.14515861
#> ---
#> 
#> Hyper-parameters: lambda=0.03528578, alpha=0.75, exponent=1

This model corresponds to the model with smallest scale of the prediction error (the blue dot in the plot above). There are a total of 15 predictors in the model. If you think a sparser model may be more appropriate for your application, you can also apply the q-standard-error rule as in the plots. The default, the one-standard-error rule leads to the following predictive model:

summary(fit_075, lambda = "2-se")
#> Adaptive PENSE fit with prediction performance estimated by 10 replications of 
#> 5-fold cross-validation.
#> 
#> 7 out of 40 predictors have non-zero coefficients:
#> 
#>                Estimate
#> (Intercept)  0.31983318
#> X1           0.72274606
#> X2          -0.77400097
#> X3           0.79400018
#> X5           0.03650988
#> X22          0.02970441
#> X29          0.15663196
#> X39         -0.13004937
#> ---
#> 
#> Hyper-parameters: lambda=0.1142336, alpha=0.75, exponent=1

In this fit, only 7 out of the 40 predictors are relevant, including the 3 truly relevant predictors. But maybe different values for hyper-parameters alpha and exponent lead to even better prediction?

Step 4: Exploring different hyper-parameters

The choice for hyper-parameters alpha and exponent (which was kept at its default value of 1) are rather arbitrary. The effects of these two hyper-parameters on the estimates are in general less pronounced than of the penalization level. But you may still want to explore different values for alpha and exponent.

While alpha=0.75 is a good value for many applications, alpha=1 may also be of interest, particularly in applications where correlation between predictors is not an issue. In applications with high correlation between predictors, lower values of alpha (e.g., alpha=0.5) may lead to more stability in variable selection.

The hyper-parameter exponent generally has an effect on the sparsity of the models. With higher values for exponent, typically only predictors with the largest (in absolute magnitude) standardized coefficients will be non-zero. While this helps to screen out many or most of the truly irrelevant, it also risks missing some of the truly relevant predictors.

Let us compute adaptive PENSE estimates for different values of hyper-parameters alpha and exponent. In the code below, this is done for two values: alpha=0.75 and alpha=1 as well as exponent=1, exponent=2, and exponent=3.

set.seed(1234)
fit_exp_1 <- adapense_cv(x, y, alpha = c(0.75, 1), exponent = 1, cv_k = 5, cv_repl = 10, cl = cluster)

set.seed(1234)
fit_exp_2 <- adapense_cv(x, y, alpha = c(0.75, 1), exponent = 2, cv_k = 5, cv_repl = 10, cl = cluster)

set.seed(1234)
fit_exp_3 <- adapense_cv(x, y, alpha = c(0.75, 1), exponent = 3, cv_k = 5, cv_repl = 10, cl = cluster)

Note that we set the seed before each call to adapense_cv(). This is again to ensure reproducibility of the CV, but also to make the estimated prediction performance more comparable across different values of the exponent hyper-parameter.

After checking that the cross-validated prediction performance of the fitted models is smooth enough to reliably select the penalization level, we can compare all the estimates. For this, the package includes the function prediction_performance(), which extracts and prints the prediction performance of all given objects:

prediction_performance(fit_exp_1, fit_exp_2, fit_exp_3)
#> Prediction performance estimated by cross-validation:
#> 
#>       Model Estimate Std. Error Predictors alpha exp.
#> 1 fit_exp_3 1.640570 0.09775037         13  1.00    3
#> 2 fit_exp_3 1.659848 0.09212068         13  0.75    3
#> 3 fit_exp_2 1.694853 0.09388246         14  0.75    2
#> 4 fit_exp_2 1.705181 0.10346154         13  1.00    2
#> 5 fit_exp_1 1.884292 0.11975320         12  1.00    1
#> 6 fit_exp_1 1.886142 0.09226434         15  0.75    1

Here we see the combination of hyper-parameters alpha=0.75 and exponent=3 (in object fit_exp_3) leads to the best prediction accuracy, but all models are based on 12 or more predictors. Instead, we can also compare sparser models with almost the same prediction performance:

prediction_performance(fit_exp_1, fit_exp_2, fit_exp_3, lambda = '1-se')
#> Prediction performance estimated by cross-validation:
#> 
#>       Model Estimate Std. Error Predictors alpha exp.
#> 1 fit_exp_3 1.719353 0.06727231         13  0.75    3
#> 2 fit_exp_3 1.726839 0.08486051         13  1.00    3
#> 3 fit_exp_2 1.752905 0.06393139         12  0.75    2
#> 4 fit_exp_2 1.803677 0.09448569         12  1.00    2
#> 5 fit_exp_1 1.977632 0.11355306         12  0.75    1
#> 6 fit_exp_1 1.996706 0.14173465         10  1.00    1

These models are still quite large. If we are interested in even sparser models, we can increase the tolerance level to, for instance, 3 standard errors:

prediction_performance(fit_exp_1, fit_exp_2, fit_exp_3, lambda = '3-se')
#> Prediction performance estimated by cross-validation:
#> 
#>       Model Estimate Std. Error Predictors alpha exp.
#> 1 fit_exp_3 1.915506 0.05804233          7  0.75    3
#> 2 fit_exp_3 1.916650 0.08945694          3  1.00    3
#> 3 fit_exp_2 1.963516 0.08579261          8  0.75    2
#> 4 fit_exp_2 2.010430 0.08138076          3  1.00    2
#> 5 fit_exp_1 2.130027 0.07301721          6  0.75    1
#> 6 fit_exp_1 2.153167 0.05492182          3  1.00    1

This relaxation still leads to the combination of hyper-parameters alpha=0.75 and exponent=3, but now with only 7 relevant predictors; including all truly relevant predictors. We can see that the estimated coefficients and estimated relevant predictors are close to the truth:

summary(fit_exp_3, alpha = 0.75, lambda = '3-se')
#> Adaptive PENSE fit with prediction performance estimated by 10 replications of 
#> 5-fold cross-validation.
#> 
#> 7 out of 40 predictors have non-zero coefficients:
#> 
#>                 Estimate
#> (Intercept)  0.220783581
#> X1           0.882998504
#> X2          -0.927853798
#> X3           0.859502737
#> X21          0.036693330
#> X22          0.009785297
#> X29          0.082190552
#> X39         -0.052051650
#> ---
#> 
#> Hyper-parameters: lambda=0.002003298, alpha=0.75, exponent=3

Using different measures of prediction performance

By default, adapense_cv() uses the τ-scale of the prediction errors to assess prediction accuracy. This can be changed by specifying a different metric via adapense_cv(cv_metric=). The package supports also the median absolute prediction error (cv_metric = "mape") or the classical root mean squared prediction error (cv_metric = "rmspe"). You should, however, not use the RMSPE to evaluate prediction performance in the potential presence of contamination. Robust methods are not designed to predict contaminated observations well and the RMSPE may be artificially inflated by poor prediction of a few contaminated response values. You can also specify your own function which takes as input the vector of prediction errors and returns a single number, measuring the prediction performance. For example, to use the mean absolute prediction error, you would write

mae <- function (prediction_errors) {
  mean(abs(prediction_errors))
}

set.seed(1234)
fit_075_mae <- adapense_cv(x, y, alpha = 0.75, cv_k = 5, cv_repl = 5, cl = cluster, cv_metric = mae)

A matrix with estimates of the prediction performance are accessible as slot $cv_replications in the object returned by adapense_cv(). The rows correspond to the different penalization levels, and columns correspond to the individual CV replications.