Skip to content

Multi-Fidelity Optimizers#

This section concerns optimizers that utilize Multi-Fidelity information to guide the search process. Multi-Fidelity is explained in detail here.

1 Successive Halfing#

Successive Halfing/SH (see paper) is a simple but effective Multi-Fidelity algorithm.

It starts with a large number of random configurations and evaluates them on a low-fidelity. The best-performing \(1/\eta\) configurations are then promoted to the next fidelity, where they are evaluated again. This process is repeated until only a few configurations remain, evaluated on the highest fidelity. The process allows for broad exploration in the beginning and focus on the most promising configurations towards the end.

See the algorithm's implementation details in the api.

Practical Tips
  • For the same total compute, SH outperforms uninformed search algorithms like random search or grid search.
  • It highly depends on the correlation between lower and higher fidelities. If the correlation is low, SH underperforms.
  • SH has two parameters: \(\eta\) and \(n\), where \(\eta\) is the promotion factor and \(n\) is the number of configurations at the lowest fidelity. This results in a total of \(\frac{n*r}{\eta^r}\) steps (from one fidelity level to the next), where \(r\) is the number of fidelity levels. For more details, see the api.

Asynchronous Successive Halving#

Asynchronous Successive Halving/ASHA (see paper) is an asynchronous version of SH that maximizes parallel evaluations.

Instead of waiting for all \(n\) configurations to finish on one fidelity, ASHA promotes the best configuration to the next fidelity as soon as there are enough evaluations to make a decision (\(1/\eta*n\geq 1\)). This allows for quicker promotions and earlier high fidelity-results. When there are no promotable configurations, ASHA spawns new configurations at the lowest fidelity, so it always utilizes the available compute and increases exploration compared to SH.

Prior-extended Successive Halving#

Although not inherently a Prior-optimizer, SH (and ASHA) can make use of Priors. Instead of sampling configurations uniformly, the optimizer can directly sample from the Prior, which results in a more focused search - highly beneficial if the Prior is reliable. Alternatively, the SH can bias the promotion of configurations towards the Prior, keeping worse-performing, but recommended configurations longer in the optimization process.

See the algorithm's implementation details in the api.

2 HyperBand#

HyperBand/HB (see paper) is an extension of Successive Halfing that employs multiple Successive Halfing-rounds in parallel.

Each of these runs has a different resource budget and different number of configurations. This makes HyperBand more flexible and parallelizable than SH.

See the algorithm's implementation details in the api.

Practical Tips
  • HyperBand is a good choice when you have a limited budget and want to parallelize your search.
  • It is more efficient than SH when the correlation between lower and higher fidelities is low.
  • Hyperband has two parameters: \(\eta\) (typically 3 or 4) and \(R\), where \(\eta\) is the promotion factor and \(R\) is the maximum budget any single configuration will be trained on. A larger \(R\) will result in better, but slower results, while a larger \(\eta\) will result in faster, but more noisy, potentially worse results. HB then spawns \(\lfloor \log_\eta(R)\rfloor\) Successive Halfing-rounds. For more details, see the api.

Info

HyperBand is chosen as the default optimizer in NePS when there is no Prior, only Multi-Fidelity information available.

3 In-Context Freeze-Thaw Bayesian Optimization#

In-Context Freeze-Thaw Bayesian Optimization/IfBO (see paper) expands on the idea of Freeze-Thaw Bayesian Optimization (FT-BO) by using a Prior-data fitted network (PFN) as a surrogate for the FT-BO.

Standard FT-BO models the performance of a configuration with a Gaussian Process, assuming exponential loss decay. It uses this joint GP to fantasize results and decides for the most informative configurations. The Entropy Search-acquisition function (see paper) quantifies this information gain: $$ a(\boldsymbol{x}) = \int\left(H\left(P^y_{\min}\right)\right) - \left(H\left(P_{\min}\right)\right)P(y| { \lbrace (\boldsymbol{x}_n,y_n) \rbrace }^N)dy $$ where \(H\) is the entropy, \(P_{\min}\) is the distribution of the minimum value, \(P^y_{\min}\) is the same distribution, but given a new observation \(y\) and \(P(y| \cdot)\) is the probability of this result \(y\), from a configuration \(\boldsymbol{x}\) (given the \(N\) observations \({\lbrace (\boldsymbol{x}_n,y_n) \rbrace}^N\) so far). So the acquisition function maximizes the information gain about the location of the minimum from evaluating any configuration \(\boldsymbol{x}\), by maximizing the entropy-reduction.

Fantasizing
The image shows the fantasizing of exponential loss decay inFT-BO . (Image Source: FT-BO-paper, Jan 27, 2025)

IfBO employs the same concept, but instead of a GP, it uses a PFN to model the performance of configurations. PFNs (see paper) are transformer networks, fitted to many (synthetic) runs. They can model the performance of configurations across all fidelities and are used in IfBO to fantasize the outcomes. The deciding advantage is that PFNs can model complex relationships between configurations and fidelities, not just exponential decay. On top of that, PFNs utilize in-context learning to quickly adapt from their general prior to the current optimization process, resulting in a better overall performance compared to GPs.

Lastly, IfBO adapts the FT-BO idea of freezing (pausing training on) configurations that are not informative anymore and thawing (resuming training on) them when they become interesting again. It therefore chooses automatically between starting new configurations or thawing old ones.

Freeze-Thaw
The image shows the Freeze-Thaw-mechanism, with the colors indicating, at what iteration a configuration has been evaluated at this fidelity. Note for example some yellow configurations being reused much later, ending in red. (Image Source: FT-BO-paper, Jan 27, 2025)

See the algorithm's implementation details in the api.

Practical Tips
  • IfBO is a good choice when the problem allows for low-fidelity configurations to be continued to retrieve high-fidelity results, utilizing neps's checkpointing feature.

For optimizers using both Priors and Multi-Fidelity, please refer here.