Bayes' Theorem Mapped to 5 Optimization Methods

What exactly gets updated with every trial?

The Template

$P(A \mid B) = \dfrac{P(B \mid A) \;\cdot\; P(A)}{P(B)}$
Engine — drives the update
Fixed — set once, stays
Bookkeeping — just normalizes
Method 1

GP — Gaussian Process

Engine: Likelihood
$P(f \mid D_{1:t}) = \dfrac{P(D_{1:t} \mid f) \;\cdot\; P(f)}{P(D_{1:t})}$

Where $f$ = a candidate function (one possible shape of the objective), and $D_{1:t}$ = all observed data $\{(x_1, y_1), \dots, (x_t, y_t)\}$.

Prior — $P(f)$ — Fixed

GP prior, defined by kernel (e.g. RBF). Chosen once at the start. Says "the function is probably smooth." Does not change across trials.

Likelihood — $P(D_{1:t} \mid f)$ — Engine 🔥

How well does this candidate function pass through all observed points? With 3 data points, many smooth curves could fit. With 20, very few can. Gets more constraining every trial, forcing the posterior to narrow.

Evidence — $P(D_{1:t})$ — Bookkeeping

Normalizing constant. Makes the posterior integrate to 1. Changes numerically but drives nothing.

Posterior $P(f \mid D_{1:t})$ = updated belief about what $f$ looks like everywhere. Gets sharper every trial.

Method 2

TPE — Tree-structured Parzen Estimator

Engine: Likelihood
$P(x \in \text{good} \mid D_{1:t}) = \dfrac{P(D_{1:t} \mid x \in \text{good}) \;\cdot\; P(x \in \text{good})}{P(D_{1:t})}$

TPE flips the perspective. Instead of modeling $P(f \mid x)$, it models $P(x \mid y)$ — the probability of inputs given good or bad scores.

Prior — $P(x \in \text{good})$ — Fixed

Initial belief that any $x$ could be good. Starts uniform. Only truly in effect during startup trials. Fades as data accumulates.

Likelihood — $P(D_{1:t} \mid x \in \text{good})$ = $\ell(x)$ — Engine 🔥

This is the KDE density of good inputs. Similarly $P(D_{1:t} \mid x \in \text{bad}) = g(x)$. Both are rebuilt from scratch every trial with more data. The acquisition function $\ell(x) / g(x)$ is the likelihood ratio — how much more likely is this $x$ to come from the good distribution versus the bad one?

Evidence — $P(D_{1:t})$ — Bookkeeping

Normalizer. Bookkeeping.

Method 3

SMAC — Random Forest

Engine: Likelihood
$P(\hat{f} \mid D_{1:t}) = \dfrac{P(D_{1:t} \mid \hat{f}) \;\cdot\; P(\hat{f})}{P(D_{1:t})}$

Where $\hat{f}$ = a random forest model that approximates $f(x)$.

Prior — $P(\hat{f})$ — Fixed

Structural prior: tree depth, number of trees. Set once. The implicit assumption is "the function can be approximated by a piecewise-constant function" (which is what decision trees do).

Likelihood — $P(D_{1:t} \mid \hat{f})$ — Engine 🔥

How well does this forest predict all observed scores? Retrained on all data every trial. With more data, the forest gets better at predicting which regions yield high scores. Variance across trees gives uncertainty estimates for exploration.

Evidence — $P(D_{1:t})$ — Bookkeeping

Normalizer. Bookkeeping.

Posterior $P(\hat{f} \mid D_{1:t})$ = updated belief about $f(x)$, expressed as forest predictions + variance across trees.

Method 4

BOHB — Bayesian Optimization + Hyperband

Engine: Likelihood
$P(x \in \text{good} \mid D_{1:t}) = \dfrac{P(D_{1:t} \mid x \in \text{good}) \;\cdot\; P(x \in \text{good})}{P(D_{1:t})}$

All terms identical to TPE. BOHB uses TPE internally for the Bayesian part.

Prior — Same as TPE — Fixed

Uniform prior over inputs.

Likelihood — Same KDE densities as TPE — Engine 🔥

Same $\ell(x)$ and $g(x)$ mechanism as TPE.

Evidence — Same as TPE — Bookkeeping

Normalizer.

The addition of Hyperband is not Bayesian — it's a resource allocation trick that decides how much budget (epochs, time) to give each trial before killing it. It makes the loop faster without changing which Bayes' theorem term is being updated.

Method 5

CMA-ES — Covariance Matrix Adaptation

Engine: Prior
$P(\mu, C \mid D_{1:t}) = \dfrac{P(D_{1:t} \mid \mu, C) \;\cdot\; P(\mu, C)}{P(D_{1:t})}$

Where $\mu$ = mean of the search distribution, $C$ = covariance matrix of the search distribution.

Prior — $P(\mu, C)$ = Gaussian $N(\mu, C)$ — Engine 🔥

A bell curve in multiple dimensions defined by mean $\mu$ and covariance $C$. Unlike all other methods, this changes every trial. The center ($\mu$) moves toward the best points. The shape ($C$) stretches in promising directions and shrinks in bad directions. The search distribution itself physically moves and reshapes.

Likelihood — $P(D_{1:t} \mid \mu, C)$ — Supporting role

How well did samples from this Gaussian score? Informs the update but isn't the main thing being refined.

Evidence — $P(D_{1:t})$ — Bookkeeping

Normalizer.

$\mu_{t+1} = \displaystyle\sum_{i=1}^{k} w_i \cdot x_i^{\text{best}}$

$C_{t+1} = \text{weighted covariance of top-}k\text{ samples}$

The mean moves toward winners. The covariance elongates in promising directions.

CMA-ES is the least "Bayesian" of the five. Some wouldn't call it Bayesian optimization at all — it's more of an evolutionary method that happens to use a probabilistic model to guide search.

The Pattern

Method Prior $P(A)$ Likelihood $P(B \mid A)$ Evidence $P(B)$ Engine
GP Kernel (fixed) Data fit to function (grows) Normalizer Likelihood
TPE Uniform (fixed) KDE densities $\ell$, $g$ (rebuilt) Normalizer Likelihood
SMAC Tree structure (fixed) Forest fit to data (retrained) Normalizer Likelihood
BOHB Same as TPE Same as TPE Normalizer Likelihood
CMA-ES $N(\mu, C)$ (moves!) Sample scores Normalizer Prior