# Overview of Bayesian Optimization

Published 14 Sep 2016

## Problem

The problem of hyperparameter optimization (i.e. configuring model parameters) in machine learning is very real and very painful.

The hyperparameter search process is essentially a “exploitation-exploration” trade-off: for how long are we willing to try out different parameters for our model in order to find an optimal configuration that’s ‘good enough’?

This particular optimization problem is difficult. Our objective function - the way we decide how good a set of hyperparameters is for our model - is often a long process involving training and validating the model to get its out-of-sample accuracy (or F1-score, etc.) and deciding if it’s ‘good enough’. Imagine doing that thousands of times! For this reason, grid search - exhaustively searching through all possible hyperparameter configurations - might give you a stellar model in the end, but you’ll be waiting… for a long time.

Random search makes the wait less painful. Previous work shows that approximately 60 rounds of random search can produce as optimal results as an entire grid search for certain models. As it turns out, not all hyperparameters are important to tune. The graphic below illustrates this.

Although this approach is nicer, it still feels like we’re stabbing in the dark for hyperparameters. More than that, as our models get more and more computationally expensive and our training/validation routines get more complex, it doesn’t feel like we’re being smart about which hyperparameter set our model should try next … given the ones it’s already tried on for size.

How can we do better?

## The Bayesian way

Bayesian optimization lends itself quite perfectly for our problem of machine learning configuration for two reasons:

(1) it allows us to model the response surface of our model’s performance using a cheaper surrogate function.

(2) it can perform the minimal number of search rounds until there is no further expected improvement in the true objective function.

There are many different ways to create this “surrogate” function using various estimators (random forests, etc.).

In the standard approach to Bayesian optimization, we construct this surrogate function for the objective using a prior and with each evaluation of the true response surface, we can perform posterior predictive inference on the performance of hyperparameters we haven’t tried yet. In this manner, we can beautifully arrive at an optimum with some amount of confidence.

## How it works

Our problem is defined as follows. Given some “tried” hyperparameter sets $X$, and their observed response values (model performance) $Y$:

Can we conditionally predict the response values $Y_{*}=[Y_{a+1}, ... ,Y_{b}]$ of new hyperparameters $X_{*}=[X_{a+1}, ... ,X_{b}]$ ?

### Gaussian Process

The primary mechanism that allows us to find this posterior is the Gaussian Process (GP) . A GP is defined by any finite set of points $Z$ such that any subset of points $\{z_1, ... , z_m\}$ where $z_i \in Z$ are random variables drawn from a multivariate Gaussian distribution.

The clever thing about the structure of Gaussian Processes is consistency. If the GP assumes:

it follows that:

where $y_1$ is sampled from the corresponding mean subvector $\mu_1$ and covariance submatrix $\Sigma_{11}$ of the larger Gaussian.

Indeed, this means implies that if we assume a Gaussian Process prior for the generation of our observed $Y$, then for some covariance matrix, we have:

and consequently the joint distribution:

Ok. So how do we exactly model the covariance matrix, i.e. how our individual hyperparameter scores $Y_i$ “relate” to each other and how that changes in the posterior after scoring new hyperparameters?

### Kernel method

In Bayesian optimization, we assume the covariance of two scores, $Y_i$ and $Y_j$, will depend on a covariance function applied to their corresponding hyperparameter sets, $X_i$ and a $X_j$.

Recall the kernel method in machine learning: instead of performing an inner product between two high dimensional vectors $X_i$ and $X_j$ in a parameter space, we may instead define a kernel $k(X_i,X_j)$. For a simple covariance function to our GP gaussian, we can assume a squared exponential:

The squared exponential is a convenient way to model score covariance here: hyperparameter sets that are very different will have scores that barely co-vary while “nearby” hyperparameter sets will have highly co-varying scores (ah, the beauty of $e$).

Building up from a covariance function, we can create a covariance matrix for our prior based on observed values $X$:

And so our full joint distribution using kernels is:

See relevant machine learning literature (perhaps even a future post) about important properties and invariants of kernels.

### Posterior predictive distribution

With the beauty of the Gaussian prior, we can now derive the following closed form expression for the posterior predictive distribution for $Y^{*} | X^{*}, Y, X$.

In Bayesian terminology, the MAP (maximum a posteriori) estimate for $Y^{*}$ would simply be the point value with the single highest probability in the posterior distribution. While this is nice and simple, we’d prefer to incorporate the entire posterior distribution to weight over all all possible predictions for $Y^{*}$. This is called posterior predictive inference.

Using some math, we can write out the equation used to draw $Y^{*}$:

where:

To add one step, we assume that $Y$ is drawn with some amount of Gaussian noise. I.e.

where $\sigma^2$ is the noise variance of some noise factor $\epsilon$ such that $Y_i = f(X_i) + \epsilon$.

Then we can adjust our predictive equation to include this noise factor:

### Acquisition function

Ok, so we have a way of modelling our expensive hyperparameter scoring function. How do we then “intelligently” select the next hyperparameter set to try out given some past trials?

We introduce the idea of an acquisition function. Given some choice of hyperparameter vector $X_{try}$, the acquisition function computes some loss based on some criteria, having observed some past $X$ and $Y$. We select the $X_{next}$ out of all possible $X_{*}$ that maximizes (or minimizes) this loss.

An example of this stepwise ‘select-and-observe’ process is well illustrated in this graphic:

The most intuitive acquisition function is perhaps Probability of Improvement $(PI)$, which selects $X_{next}$ as the point that has the highest probability of an improved response value over the current observed best $X_{best}$. The acquisition function for potential points looks like:

where $\Phi()$ is the cumulative distribution for the standard normal. Intuitively, this makes sense to use, since $\gamma(X_{try})$ is the Z score of $f(X_{try})$ in the the Gaussian Process’s gaussian distribution.

Instead of finding the point that has some high probability estimate of improving our best score, we could instead maximize the Expected Improvement $(EI)$. I personally find this to be a powerful, but simple method of picking the next hyperparameter set:

Other interesting acquisition functions also exist that allow the user to tune exploration vs. exploitation of the hyperparameter space; see [2].

## Code

Now onto some coding to see this stuff in action! We can actually quite easily implement a basic version of Bayesian optimization with accompanying visualizations in Python using numpy, scipy, matplotlib. If you like pretty plots, I would recommend using the seaborn library (an extension of matplotlib).

### Setup

The basic header of our script:

To ‘animate’ our Bayesian posteriors, we can turn on interactive mode in matplotlib:

Some examples of basic kernel functions that we can use:

And now for the three key functions we need to make this work…

The kernel matrix builder:

The posterior prediction function for a new hyperparameter set:

And the posterior covariance-updater, given a new point:

We can create a dummmy ‘machine learning model’ that has a single hyperparameter with a range of possible values from 0 to 10 and a simulated hyperparameter scoring function f. We’ll also assume that we’ve already observed the responses of 4 points:

Here X_tst will be the set of all hyperparameters that we will consequently predict the scores of in our Bayesian optimization process. For now, we set our noise variance to zero.

### Randomly select hyperparameters

The following code will randomly sample points from X_tst, evaluate their scores using f, and update the Gaussian Process posterior one point at a time, visualizing each update with 95% confidence intervals of the surrogate response function for each point.

This looks like:

This shows exactly how the GP models hyperparameter score as a function of the hyperparameter choice. However, we’re again just randomly selecting hyperparameters from the space of choices X without using any acquisition functions to guide us…

### Randomly select hyperparameters with noise

Now, let’s say that we think our machine learning model’s performance isn’t very stable. This is a reasonable assumption since model performance usually differs depending on the data it’s trained on. For a “high-quality” training set, it might perform with 90% accuracy, while for a “lower-quality” sample, it might only yield 70% accuracy. Let’s set our noise variance to an arbitrary value and see how the confidence intervals for the objective function changes:

We now predict hyperparameter scores with a little bit more uncertainty. Pretty cool! But we’re still just stabbing in the dark for hyperparameters. Let’s see what this looks like when we use our acquisition function to tell us which hyperparameter our model should try on next.

### Fix rounds of hyperparameter selection using EI

An implementation of the $EI$ acquisition function:

Now, instead of randomly selecting points from X_tst, we’ll pick the next point to evaluate using the acquisition function like so:

The full code visualizing each update as well as the acquisition score for all hyperparameter choices (not on the same y-axis):

Here’s what that looks like:

Note that each new point (blue dot) is chosen so as to maximize the acquisition function. Our acquistion function is now truly guiding us in finding better hyperparameters!

### Hyperparameter selection using EI threshold

Lastly, we want to completely remove any hyperparameters on our hyperparameter optimizer. So instead of doing this search for some arbitrary number of rounds, let’s perform the search until the EI is below some reasonable threshold (e.g. 1.0):

Our hyperparameter selector picks the optimal hyperparameter set in just two rounds!

# Conclusion

In summary, this post serves as a light, but hearty introduction to Bayesian optimization, a better way to fine-tune your machine learning models than just running grid search and going off and taking a nap!

Putting together all the above code, here’s the full Python implementation of Bayesian optimization. We used just a single hyperparameter dimension, but this would very well scale up to a grid of hyperparameters. Try it out for some dummy model or a machine learning model that you’re trying to tune!

Some stuff I didn’t really talk a lot about was kernel, assumptions behind the Bayesian optimization response surface, and other methods of modelling the hyperparameter performance function. Check out the related papers for more details on those nuances. Cheers!

[1] Rasmussen E., and Williams C. Gaussian Processes for Machine Learning. The MIT Press, 2006. Easily the most thorough introduction to GP applications.

[2] Snoeck J., Larochelle H., and Adams. R. Practical Bayesian Optimization of Machine Learning Algorithms. Connects the dots between theory on GP and ML optimization.