**Table of Contents:**

- 1. The Problem with Training ML Models
- 2. A Bayesian Solution
- 3. How Bayesian Optimization Works
- 4. Code
- 5. Conclusion
- References

# 1. The Problem with Training ML Models

The problem of hyperparameter optimization (i.e. configuring model parameters) in machine learning is very real and very painful. It gets worse the more complex your ML model is. It’s something I dealt with a lot on my quest to train an optimal ensemble classifier for government website transparency.

The hyperparameter search process is essentially a “exploitation-exploration” trade-off: for how long are we willing to try out different parameters for our model in order to find an optimal configuration that’s ‘good enough’?

This particular optimization problem is difficult. Our objective function - the way we decide how good a set of hyperparameters is for our model - is often a long process involving training and validating the model to get its out-of-sample accuracy (or F1-score, etc.) and deciding if it’s ‘good enough’. Imagine doing that thousands of times! For this reason, grid search - exhaustively searching through all possible hyperparameter configurations - might give you a stellar model in the end, but you’ll be waiting … for a long time.

Random search makes the wait less painful. Previous work shows that approximately 60 rounds of random search can produce as optimal results as an entire grid search for certain models. As it turns out, not all hyperparameters are important to tune. The graphic below illustrates this.

Although this approach is nicer, it still feels like we’re stabbing in the dark for hyperparameters. More than that, as our models get more and more computationally expensive and our training/validation routines get more complex, it doesn’t feel like we’re being smart about which hyperparameter set our model should try next … given the ones it’s already tried on for size.

How can we do better?

# 2. A Bayesian Solution

**Bayesian optimization** lends itself quite perfectly for our problem
of machine learning configuration for two reasons:

(1) it allows us to model the response surface of our model’s performance using a cheaper *surrogate* function.

(2) it can perform the minimal number of search rounds until there is no further *expected improvement* in the true objective function.

There are many different ways to create this “surrogate” function using various estimators (random forests, etc.).

In the standard approach to Bayesian optimization, we construct this surrogate
function for the objective using a *prior* and with each evaluation of the true
response surface, we can perform *posterior predictive* inference on the
performance of hyperparameters we haven’t tried yet. In this manner, we can
beautifully arrive at an optimum with some amount of confidence.

# 3. How Bayesian Optimization Works

Our problem is defined as follows. Given some “tried” hyperparameter sets \(X\), and their observed response values (model performance) \(Y\):

Can we conditionally predict the response values \(Y_{*}=[Y_{a+1}, ... ,Y_{b}]\) of new hyperparameters \(X_{*}=[X_{a+1}, ... ,X_{b}]\) ?

## 3.1. Gaussian Process: keeping track of what works

The primary mechanism that allows us to find this posterior is the **Gaussian Process (GP)** . A GP is defined by any finite set of points \(Z\) such that any subset of points \(\{z_1, ... , z_m\}\) where \(z_i \in Z\) are random variables drawn from a multivariate Gaussian distribution.

The clever thing about the structure of Gaussian Processes is *consistency*. If
the GP assumes:

it follows that:

where \(y_1\) is sampled from the corresponding mean subvector \(\mu_1\) and
covariance submatrix \(\Sigma_{11}\) of the *larger* Gaussian.

Indeed, this means implies that if we assume a Gaussian Process prior for the generation of our observed \(Y\), then for some covariance matrix, we have:

and consequently the joint distribution:

Ok. So how do we exactly model the covariance matrix, i.e. how our individual hyperparameter scores \(Y_i\) “relate” to each other and how that changes in the posterior after scoring new hyperparameters?

## 3.2. Kernels: modelling distance in the hyperparameter space

In Bayesian optimization, we assume the covariance of two scores, \(Y_i\) and \(Y_j\), will depend on a covariance function applied to their corresponding hyperparameter sets, \(X_i\) and a \(X_j\).

Recall the **kernel method** in machine learning: instead of performing an inner
product between two high dimensional vectors \(X_i\) and \(X_j\) in a parameter
space, we may instead define a kernel \(k(X_i,X_j)\). For a simple covariance
function to our GP gaussian, we can assume a *squared exponential*:

The *squared exponential* is a convenient way to model score covariance here:
hyperparameter sets that are very different will have scores that barely
co-vary while “nearby” hyperparameter sets will have highly co-varying scores
(ah, the beauty of \(e\)).

Building up from a covariance function, we can create a covariance matrix for our prior based on observed values \(X\):

And so our full joint distribution using kernels is:

See relevant machine learning literature (perhaps even a future post) about important properties and invariants of kernels.

## 3.3. Posterior predictive distribution

With the beauty of the Gaussian prior, we can now derive the following closed form expression for the posterior predictive distribution for \(Y^{*} | X^{*}, Y, X\).

In Bayesian terminology, the MAP (maximum a posteriori) estimate for
\(Y^{*}\) would simply be the point value with the single highest probability in
the posterior distribution. While this is nice and simple, we’d prefer to
incorporate the *entire* posterior distribution to weight over all
all possible predictions for \(Y^{*}\). This is called *posterior predictive
inference*.

Using some math, we can write out the equation used to draw \(Y^{*}\):

where:

To add one step, we assume that \(Y\) is drawn with some amount of Gaussian noise. I.e.

where \(\sigma^2\) is the noise variance of some noise factor \(\epsilon\) such that \(Y_i = f(X_i) + \epsilon\).

Then we can adjust our predictive equation to include this noise factor:

## 3.4. Acquisition function: selecting what to try next

Ok, so we have a way of modelling our expensive hyperparameter scoring function. How do we then “intelligently” select the next hyperparameter set to try out given some past trials?

We introduce the idea of an **acquisition function**. Given some choice of
hyperparameter vector \(X_{try}\), the acquisition function computes some loss
based on some criteria, having observed some past \(X\) and \(Y\). We select the
\(X_{next}\) out of all possible \(X_{*}\) that maximizes (or minimizes) this
loss.

An example of this stepwise ‘select-and-observe’ process is well illustrated in this graphic:

The most intuitive acquisition function is perhaps *Probability of Improvement*
\((PI)\), which selects \(X_{next}\) as the point that has the highest
probability of an improved response value over the current observed best
\(X_{best}\). The acquisition function for potential points looks like:

where \(\Phi()\) is the cumulative distribution for the standard normal. Intuitively, this makes sense to use, since \(\gamma(X_{try})\) is the Z score of \(f(X_{try})\) in the the Gaussian Process’s gaussian distribution.

Instead of finding the point that has some high probability estimate of
improving our best score, we could instead maximize the
*Expected Improvement* \((EI)\). I personally find this to be a powerful,
but simple method of picking the next hyperparameter set:

Other interesting acquisition functions also exist that allow the user to tune exploration vs. exploitation of the hyperparameter space; see [2].

# 4. Code

Now onto some coding to see this stuff in action! We can actually quite easily
implement a basic version of Bayesian optimization with accompanying
visualizations in Python using `numpy`

, `scipy`

, `matplotlib`

. If you like
pretty plots, I would recommend using the `seaborn`

library (an extension
of matplotlib).

## 4.1. Setup

The basic header of our script:

To ‘animate’ our Bayesian posteriors, we can turn on interactive mode in matplotlib:

Some examples of basic kernel functions that we can use:

And now for the three key functions we need to make this work…

The kernel matrix builder:

The posterior prediction function for a new hyperparameter set:

And the posterior covariance-updater, given a new point:

We can create a dummmy ‘machine learning model’ that has a single hyperparameter
with a range of possible values from 0 to 10 and a simulated hyperparameter
scoring function `f`

. We’ll also assume that we’ve already observed the
responses of 4 points:

Here `X_tst`

will be the set of all hyperparameters that we will consequently
predict the scores of in our Bayesian optimization process. For now, we set our
noise variance to zero.

## 4.2. Randomly select hyperparameters

The following code will randomly sample points from `X_tst`

, evaluate their
scores using `f`

, and update the Gaussian Process posterior one point at a time,
visualizing each update with 95% confidence intervals of the surrogate response
function for each point.

This looks like:

This shows exactly how the GP models hyperparameter score as a function
of the hyperparameter choice. However, we’re again just randomly selecting
hyperparameters from the space of choices `X`

without using any acquisition
functions to guide us…

## 4.3. Randomly select hyperparameters with noise

Now, let’s say that we think our machine learning model’s performance isn’t very stable. This is a reasonable assumption since model performance usually differs depending on the data it’s trained on. For a “high-quality” training set, it might perform with 90% accuracy, while for a “lower-quality” sample, it might only yield 70% accuracy. Let’s set our noise variance to an arbitrary value and see how the confidence intervals for the objective function changes:

We now predict hyperparameter scores with a little bit more uncertainty. Pretty cool! But we’re still just stabbing in the dark for hyperparameters. Let’s see what this looks like when we use our acquisition function to tell us which hyperparameter our model should try on next.

## 4.4. Fix rounds of hyperparameter selection using EI

An implementation of the \(EI\) acquisition function:

Now, instead of randomly selecting points from `X_tst`

, we’ll pick the next point to evaluate using the acquisition function like so:

The full code visualizing each update *as well as* the acquisition score for all hyperparameter choices (not on the same y-axis):

Here’s what that looks like:

Note that each new point (blue dot) is chosen so as to *maximize* the
acquisition function. Our acquistion function is now truly guiding us
in finding better hyperparameters!

## 4.5. Hyperparameter selection using EI threshold

Lastly, we want to completely remove any hyperparameters *on our* hyperparameter
optimizer. So instead of doing this search for some arbitrary number of rounds,
let’s perform the search until the EI is below some reasonable threshold (e.g.
1.0):

Our hyperparameter selector picks the optimal hyperparameter set in just two rounds!

# 5. Conclusion

In summary, this post serves as a light, but hearty introduction to Bayesian optimization, a better way to fine-tune your machine learning models than just running grid search and going off and taking a nap!

Putting together all the above code, here’s the full Python implementation of Bayesian optimization. We used just a single hyperparameter dimension, but this would very well scale up to a grid of hyperparameters. Try it out for some dummy model or a machine learning model that you’re trying to tune!

Some stuff I didn’t really talk a lot about was kernel, assumptions behind the Bayesian optimization response surface, and other methods of modelling the hyperparameter performance function. Check out the related papers for more details on those nuances. Cheers!

# References

_{[1] Rasmussen E., and Williams C. Gaussian Processes for Machine Learning. The MIT Press, 2006. Easily the most thorough introduction to GP applications. }

_{[2] Snoeck J., Larochelle H., and Adams. R. Practical Bayesian Optimization of Machine Learning Algorithms. Connects the dots between theory on GP and ML optimization.}