What exactly gets updated with every trial?
Where $f$ = a candidate function (one possible shape of the objective), and $D_{1:t}$ = all observed data $\{(x_1, y_1), \dots, (x_t, y_t)\}$.
GP prior, defined by kernel (e.g. RBF). Chosen once at the start. Says "the function is probably smooth." Does not change across trials.
How well does this candidate function pass through all observed points? With 3 data points, many smooth curves could fit. With 20, very few can. Gets more constraining every trial, forcing the posterior to narrow.
Normalizing constant. Makes the posterior integrate to 1. Changes numerically but drives nothing.
Posterior $P(f \mid D_{1:t})$ = updated belief about what $f$ looks like everywhere. Gets sharper every trial.
TPE flips the perspective. Instead of modeling $P(f \mid x)$, it models $P(x \mid y)$ — the probability of inputs given good or bad scores.
Initial belief that any $x$ could be good. Starts uniform. Only truly in effect during startup trials. Fades as data accumulates.
This is the KDE density of good inputs. Similarly $P(D_{1:t} \mid x \in \text{bad}) = g(x)$. Both are rebuilt from scratch every trial with more data. The acquisition function $\ell(x) / g(x)$ is the likelihood ratio — how much more likely is this $x$ to come from the good distribution versus the bad one?
Normalizer. Bookkeeping.
Where $\hat{f}$ = a random forest model that approximates $f(x)$.
Structural prior: tree depth, number of trees. Set once. The implicit assumption is "the function can be approximated by a piecewise-constant function" (which is what decision trees do).
How well does this forest predict all observed scores? Retrained on all data every trial. With more data, the forest gets better at predicting which regions yield high scores. Variance across trees gives uncertainty estimates for exploration.
Normalizer. Bookkeeping.
Posterior $P(\hat{f} \mid D_{1:t})$ = updated belief about $f(x)$, expressed as forest predictions + variance across trees.
All terms identical to TPE. BOHB uses TPE internally for the Bayesian part.
Uniform prior over inputs.
Same $\ell(x)$ and $g(x)$ mechanism as TPE.
Normalizer.
The addition of Hyperband is not Bayesian — it's a resource allocation trick that decides how much budget (epochs, time) to give each trial before killing it. It makes the loop faster without changing which Bayes' theorem term is being updated.
Where $\mu$ = mean of the search distribution, $C$ = covariance matrix of the search distribution.
A bell curve in multiple dimensions defined by mean $\mu$ and covariance $C$. Unlike all other methods, this changes every trial. The center ($\mu$) moves toward the best points. The shape ($C$) stretches in promising directions and shrinks in bad directions. The search distribution itself physically moves and reshapes.
How well did samples from this Gaussian score? Informs the update but isn't the main thing being refined.
Normalizer.
The mean moves toward winners. The covariance elongates in promising directions.
CMA-ES is the least "Bayesian" of the five. Some wouldn't call it Bayesian optimization at all — it's more of an evolutionary method that happens to use a probabilistic model to guide search.
| Method | Prior $P(A)$ | Likelihood $P(B \mid A)$ | Evidence $P(B)$ | Engine |
|---|---|---|---|---|
| GP | Kernel (fixed) | Data fit to function (grows) | Normalizer | Likelihood |
| TPE | Uniform (fixed) | KDE densities $\ell$, $g$ (rebuilt) | Normalizer | Likelihood |
| SMAC | Tree structure (fixed) | Forest fit to data (retrained) | Normalizer | Likelihood |
| BOHB | Same as TPE | Same as TPE | Normalizer | Likelihood |
| CMA-ES | $N(\mu, C)$ (moves!) | Sample scores | Normalizer | Prior |