Whenever we numerically optimize a function, the optimization algorithm and chosen starting values can have a substantial effect on the computational cost or even on the results. The purpose of the {ino} R package is to provide a comprehensive toolbox for comparing different initialization strategies when performing numerical optimization.
In the following, we demonstrate the main functionality of {ino}. In particular, we show how to numerically optimize a likelihood function, thereby applying different strategies for the choice of the starting values and the optimization strategy, and eventually comparing the optimization outcomes.
We consider the popular faithful
data set that is
provided via base R.1 The following histogram of eruption times
indicate two clusters with short and long eruption times,
respectively:
data("faithful")
ggplot(faithful, aes(x = eruptions, y = ..density..)) +
geom_histogram(bins = 30) + xlab("Eruption time (min)") + ylab("density")
For both clusters, we assume a normal distribution here, such that we consider a mixture of two Gaussian densities for modeling the overall eruption times. The log-likelihood is given by
\[ \ell(\boldsymbol{\theta}) = \sum_{i=1}^n \log\Big( \pi f_{\mu_1, \sigma_1^2}(x_i) + (1-\pi)f_{\mu_2,\sigma_2^2} (x_i) \Big), \] where \(f_{\mu_1, \sigma_1^2}\) and \(f_{\mu_2, \sigma_2^2}\) denote the normal density for the first and second cluster, respectively, and \(\pi\) is the mixing proportion. The vector of parameters to be estimated is thus \(\boldsymbol{\theta} = (\mu_1, \mu_2, \sigma_1, \sigma_2, \pi)\). As there exists no closed-form solution for the maximum likelihood estimator, we need numerical optimization for finding the function optimum, where the {ino} package will help us in applying different initialization strategies.
The following function implements the log-likelihood of the normal
mixture. Note that we restrict the standard deviations
sigma
to be positive and pi
to be between 0
and 1, and that the function returns the negative log-likelihood
value.
<- function(theta, data, column){
normal_mixture_llk <- theta[1:2]
mu <- exp(theta[3:4])
sigma <- plogis(theta[5])
pi <- sum(log(pi * dnorm(data[[column]], mu[1], sigma[1]) +
logl 1 - pi) * dnorm(data[[column]], mu[2], sigma[2])))
(return(-logl)
}
We setup an ino
object using the function
setup_ino
. Here, f
constitutes the function to
be optimized (i.e. normal_mixture_llk
), npar
gives the number of parameters, data
gives the vector of
observations as required by our likelihood function, and
opt
sets the optimizer. Behind the scenes,
setup_ino
runs several checks of the inputs, such as
checking whether a function and a optimizer have been provided, whether
the function f
can be called, etc.
<- setup_ino(
geyser_ino f = normal_mixture_llk,
npar = 5,
data = faithful,
column = "eruptions",
opt = set_optimizer_nlm()
)
With the ino
object geyser_ino
, we can now
optimize the likelihood function. We will first consider fixed starting
values, i.e. we make “educated guesses” about starting values that are
probably close to the optimum. Based on the histogram above, the means
of the two normal distributions may be somewhere around 2 and 4. For the
variances, we set the starting values to 1 (note that we use the log
transformation here since we restrict the standard deviations to be
positive by using exp()
in the log-likelihood function). We
will use two sets of starting values where the means are lower and
larger than 2 and 4, respectively. For both sets, the starting value for
the mixing proportion is 0.5. To compare the results of different
optimization runs, we also select a set of starting values which are
somewhat unplausible.
<- list(c(1.7, 4.3, log(1), log(1), qlogis(0.5)),
starting_values c(2.3, 3.7, log(1), log(1), qlogis(0.5)),
c(10, 8, log(0.1), log(0.2), qlogis(0.5)))
The function provided by {ino} to run the optimization with chosen
starting values is fixed_initialization()
. We then loop
over the set of starting values by passing them to the argument
at
.
for(val in starting_values)
<- fixed_initialization(geyser_ino, at = val) geyser_ino
Let’s look at the results:
summary(geyser_ino)
#> .strategy .time .optimum code iterations
#> 1 fixed 0.010668039 secs 276.360 1 17
#> 2 fixed 0.009079933 secs 276.360 1 16
#> 3 fixed 0.014434099 secs 421.417 1 32
The last run (with the unplausible starting values) converged to a local optimum. We discard this run from further comparisons:
<- clear_ino(geyser_ino, which = 3) geyser_ino
When using randomly chosen starting values, we apply the function
random_initialization()
. The most simple (yet
not necessarily effective) function call would be as
follows:
<- random_initialization(geyser_ino) geyser_ino
Here, npar
starting values are randomly drawn from a
standard normal distribution. Depending on the application and the
magnitude of the parameters to be estimated, this may not be a good
guess. We can, however, easily modify the distribution that is used to
draw the random numbers. For example, the next code snippet uses
starting values drawn from a \(\mathcal{N}(2,
0.5)\) distribution:
<- random_initialization(geyser_ino, sampler = stats::rnorm, mean = 2, sd = 0.5) geyser_ino
The argument sampler
allows to use any random number
generator, while further arguments for the sampler can easily be added.
As it was done above for fixed starting values, we could of course also
use a loop here to run multiple optimization.
The geyser_ino
object now contains the results and
further information on four optimization runs (two runs with fixed and
two with randomly selected starting values). We can use
summary()
to obtain summary statistics for the optimization
runs conducted. Moreover, we also can add grouping variables, for
example to compare the average computation time for the two
approaches:
summary(geyser_ino, group = ".strategy", "average_time" = mean(.time))
#> .strategy runs average_time
#> 1 fixed 2 0.009873986 secs
#> 2 random 2 0.017392993 secs
In addition, summary()
by default prints the number of
optimization runs for each group.
Due to the relatively simple statistical model considered in this example, the average computation time in our examples is fairly low. However, when considering more complex models, the computation time can substantially differ across different optimization strategies.
If we do not set any group, summary()
prints the entire
table displaying all optimization runs.
summary(geyser_ino)
#> .strategy .time .optimum code iterations
#> 1 fixed 0.010668039 secs 276.36 1 17
#> 2 fixed 0.009079933 secs 276.36 1 16
#> 3 random 0.018209934 secs 276.36 2 39
#> 4 random 0.016576052 secs 276.36 1 31
Here, the runs with fixed chosen starting values require much less
iterations until convergence than the optimization with randomly chosen
starting values. The final (negative) log-likelihood value (column
.optimum
) further indicates that the convergence to local
optimum for the random initialized runs.
For standardized initialization, we standardize all columns in our data before running the optimization. Specifically, after standardizing the data, we can again select to use either fixed or randomly chosen starting values. In the example code snippet shown below, we consider five sets of randomly selected starting values:
for(i in 1:5)
<- standardize_initialization(
geyser_ino initialization = random_initialization())
geyser_ino, summary(geyser_ino, group = ".strategy", "average_time" = mean(.time))
#> .strategy runs average_time
#> 1 fixed 2 0.009873986 secs
#> 2 random 2 0.017392993 secs
#> 3 standardize>random 5 0.014784002 secs
If the data set considered shows some complex structures or is very large, numerical optimization may become computationally costly. In such cases, subset initialization can be a helpful tool.
The function subset_initialization
uses a subsample from
the data via the argument prop
to optimize the likelihood.
We can then fit our model to the subset using either fixed or random
starting values. For example, the following code snippet randomly
samples 50% of the observations and uses random initialization:
for(i in 1:5)
<- subset_initialization(
geyser_ino prop = 0.5, initialization = random_initialization()) geyser_ino,
After having optimized the likelihood on the subsample,
subset_initialization()
automatically fits the model to the
entire data set using the resulting parameter estimates of the subset as
starting values. The resulting optimization run is labelled as
subset>random
in the summary()
table:
summary(geyser_ino, group = ".strategy", "average_time" = mean(.time))
#> .strategy runs average_time
#> 1 fixed 2 0.009873986 secs
#> 2 random 2 0.017392993 secs
#> 3 standardize>random 5 0.014784002 secs
#> 4 subset>random 5 0.015493393 secs
In addition to selecting subsamples at random, we can specify two
further options using the argument how
. When dealing with
time series data, we usually do not want to delete single observations
at random. Instead, we can select a proportion of the first rows by
specifying how = "first"
. We could also cluster our data
first using how = "kmeans"
. For all subset optimization
strategies, subset_initialization()
first fits the model to
the chosen subset and stores the resulting estimates. These estimates
are then used as initial values in the second step where the model is
fitted to the entire data.
In some of the examples above, we combined two optimization strategies. For example, we have chosen random starting values for a subset. The flexibility of {ino} also allows more than two combinations. We could, for example, select a subset, standardize the data, and then use random starting values.
The data set contains information about the eruption times of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA: the eruption time and the waiting time to the next eruption (both given in minutes).↩︎